AMD MI355X Doubles AI Throughput With Fewer Chips

Less is sometimes more. That is the counterintuitive engineering lesson AMD chose to put centre stage at the IEEE International Solid-State Circuits Conference (ISSCC) in San Francisco last month, when the company's fellow design engineer Ramasamy Adaikkalavan laid bare the architectural choices behind the Instinct MI355X accelerator. The headline finding: AMD deliberately reduced the number of processing units on each compute die, then redesigned what remained so thoroughly that the chip still delivers roughly double the useful throughput of its predecessor.

For Australian enterprises and research institutions evaluating AI infrastructure spending, the disclosure matters. The global accelerator market is dominated by two players, and for years that has meant one real choice. The MI355X is AMD's most serious attempt yet to offer a credible alternative.

Fewer Units, Twice the Output

As reported by Tom's Hardware, each Accelerator Complex Die (XCD) in the MI355X contains 32 active compute units, down from 38 in the MI300X, but AMD doubled per-CU FP8 throughput in the process, from 4,096 FLOPS per clock to 8,192, by redesigning the matrix execution hardware rather than simply adding more of it. The decision to drop to 32 units was not simply a cost-cutting measure. Adaikkalavan explained that the 32-unit count is intentional: "It maintains a clean power-of-two structure, which simplifies tensor tiling and workload partitioning for the AI kernels." A power-of-two count makes it easier for AI kernels to divide work evenly across the hardware, reducing the tail effect, which is the performance penalty that occurs when the last batch of work does not fill the available compute resources.

The result is that the MI355X delivers five petaflops of FP8 compute, a 1.9x improvement over the MI300X, while fitting within the same 110 square millimetre die area per Accelerator Complex Die. That is a meaningful demonstration of what architectural discipline can achieve when raw transistor count is treated as a constraint rather than a solution.

The XCD also gained two additional metal routing layers with the jump from TSMC's N5 to N3P process node, increasing the metal stack from 15 to 17 layers. "Even though Moore's law is slowing down, we still get some good linear scaling in logic density," Adaikkalavan noted.

A Leaner I/O Subsystem

The MI300X used four separate I/O dies, but AMD has reduced this to two larger dies in the MI355X, which are directly connected to each other. AMD says this delivers meaningful efficiency gains beyond simply reducing die count. Fewer die-to-die crossings have enabled AMD to remove the circuitry previously required to handle domain crossings and protocol translations, and the freed-up area went into widening the Infinity Fabric data pipeline so that peak HBM bandwidth could be delivered at lower operating voltages and frequencies.

AMD claims 1.3x better HBM read bandwidth per watt compared to the MI300X as a result; the raw bandwidth figure increased 1.5x, from 5.3 to 8.0 TB/s, but the efficiency gain came from running the fabric at a less power-hungry operating point. The chip carries 288 GB of HBM3E memory delivering a peak bandwidth of 8 TB/s. That compares favourably to Nvidia's B200, which carries 192 GB of HBM, a full 96 GB less, and a peak bandwidth of 5 TB/s.

Where AMD Stands Against Nvidia

AMD's own performance claims are compelling, if not without caveats. The MI355X achieved 93,045 tokens per second on the Llama 2 70B benchmark in MLPerf Inference v5.1, representing a 2.7x improvement over the MI325X. On training, AMD's data shows the MI355X completing a Llama 2 70B LoRA fine-tuning run in 10.18 minutes, versus 11.15 minutes for the GB200, about 10 per cent faster. Adaikkalavan was candid about what that parity represents: "We are actually matching the performance of the more expensive and complex GB200. It tells you a couple of things. One, we have strong hardware, which we always knew. And second, the open software frameworks have made tremendous progress."

Those benchmarks deserve scrutiny, however. AMD's training result came from MLPerf Training v5.1 using FP4, while the Nvidia figure is the GB200's last published FP8 score from MLPerf Training v5.0; Nvidia has not submitted a comparable FP4 training result. The comparison is therefore not perfectly apples-to-apples, and independent analysts have flagged this. In internal throughput comparisons, AMD showed roughly a threefold improvement in token generation across several models, but those figures pit the MI355X's FP4 results against the MI300X's FP8 results, and the MI300X never supported FP4, so the data does not isolate hardware gains from software and data-format improvements.

The picture shifts considerably in Mixture-of-Experts (MoE) workloads, which are increasingly common in frontier AI models. Recent MoE inference benchmarks indicate that Nvidia's Blackwell-based GB200 NVL72 rack-scale platform has demonstrated up to 28 times higher throughput than AMD's Instinct MI355X in high-concurrency MoE scenarios such as DeepSeek-R1. That gap, analysts argue, is less about raw chip performance and more about AMD currently lacking an equivalent to Nvidia's NVLink Switch-based rack fabric, which limits MoE scaling efficiency at high concurrency.

The Software Question

Hardware specifications only tell part of the story. Nvidia has the more mature CUDA ecosystem, while AMD uses ROCm with native PyTorch and TensorFlow support. For many enterprise buyers, particularly those with existing workflows and trained teams, the software switching cost is as consequential as the price per GPU. Australian organisations running complex AI pipelines on existing CUDA-optimised toolchains face genuine friction in any migration, regardless of how compelling the hardware specifications appear on paper.

That said, AMD's software position has improved substantially. An open benchmarking initiative running vLLM workloads across multiple cloud providers concluded that the MI355X matches or beats competing GPUs on tokens per dollar and offers approximately a threefold improvement in tokens per megawatt compared with previous AMD generations. The open-source nature of AMD's ROCm platform also matters in environments where vendor lock-in is a concern, particularly in public research institutions and government-adjacent data centres where procurement rules favour open ecosystems.

Implications for Australian AI Infrastructure

Australia's research and enterprise sectors are deepening their AI compute investments, from university supercomputing nodes to commercial cloud deployments through local hyperscaler regions. The CSIRO and major universities increasingly rely on GPU-accelerated clusters for climate modelling, genomics, and large language model research. For those buyers, AMD's memory advantage is tangible: keeping a 400-billion-parameter model entirely in-chip memory removes the need for expensive model-parallelism sharding across multiple accelerators, which simplifies deployment and reduces latency.

Cost is also a factor that procurement officers cannot ignore. Cloud pricing data tracked by independent platforms shows MI355X instances available at roughly $5.45 per GPU-hour on average, while GB200 instances sit considerably higher. That gap narrows considerably when factoring in software engineering time and the denser compute per rack that Nvidia's integrated systems can deliver on MoE workloads. Neither platform is unambiguously cheaper at scale.

The ISSCC presentation confirms that AMD is no longer competing on volume alone. The MI355X's engineering represents a genuine architectural rethink, one that prioritises efficiency per unit of silicon over brute-force transistor counts. Whether that philosophy translates into market share gains against Nvidia's deeply entrenched position will depend as much on software maturity and system-level integration as on any benchmark result from a conference stage in San Francisco. For Australian IT decision-makers, the honest answer is that both platforms now have legitimate claims depending on the workload. The prudent approach is rigorous workload-specific testing before committing to either ecosystem at scale. That is less satisfying than a clear winner, but it is closer to the truth.