Nvidia Groq 3 LPU: The $20bn bet on specialised AI inference

Nvidia has made its largest acquisition ever work. At GTC 2026, CEO Jensen Huang unveiled how the $20 billion Groq acquisition translates into silicon: the Groq 3 language processing unit (LPU), now integrated into the Vera Rubin platform as a dedicated inference accelerator.

The speed case

Using this technology, Nvidia can now serve massive trillion parameter large language models (LLMs) at hundreds or even thousands of tokens a second per user, according to Ian Buck, VP of Hyperscale and HPC at Nvidia. This matters because in today's AI infrastructure wars, speed costs money. By combining its GPUs with Groq's LPUs, Nvidia wagers inference providers will be able to charge as much as $45 per million tokens generated, compared to OpenAI's current $15 per million output tokens for its top model.

The real question is why you would want to pay three times as much. The answer: latency. While Nvidia's Rubin GPUs excel at processing prompts in bulk, they are less effective at the bandwidth-hungry task of generating individual tokens with minimal delay. Groq's latest chip tech achieves 150 TB/s of memory bandwidth compared to Nvidia's 22 TB/s, making it an ideal decode accelerator.

Nvidia plans to cram 256 of the chips into a new LPX rack system connected via a custom Spectrum-X interconnect to a neighbouring Vera-Rubin NVL72 rack system. When used together with the Vera Rubin NVL72 rack, customers could see 35 times higher throughput per megawatt of power and 10 times more revenue opportunity.

The maths of specialisation

Here is where the trade-offs emerge. Each Groq 3 LPU contains 500 MB of on-board memory, about 1/500th of the capacity of Nvidia's Rubin GPU. This explains why 256 chips are needed: Even with 256 chips per rack, that's only 128 GB of ultra-fast memory, which is nowhere near enough for trillion-parameter models like Kimi K2. At 4-bit precision, Nvidia itself notes that roughly 1,000 LPUs would be needed to hold a single trillion-parameter model in memory. LPUs won't replace Nvidia's GPUs but rather augment them.

This creates a genuine design trade-off. Groq's LPUs sacrifice capacity for speed. They excel at the specific job of generating tokens token-by-token with minimal latency. They are terrible at everything else.

Ultra-low latency inference has previously been dominated by a handful of boutique chip makers like Cerebras, SambaNova, and Groq, which Nvidia all but absorbed as part of an acquihire late last year. That absorption removes Nvidia's chief specialist competition. Interestingly, AWS announced a competing partnership with Cerebras at nearly the same time, suggesting the market has decided specialised inference chips are worth the engineering effort.

The real strategic implication is that Nvidia's era of the universal accelerator may be ending. For years, a single GPU did training, inference, simulation, and graphics work. The Vera Rubin platform with Groq integration abandons that philosophy. It says different workloads need different silicon. That is both a strength, because it allows for better optimisation, and a weakness, because it adds architectural complexity to data centre operations.

Nvidia hired Groq founder Jonathan Ross, president Sunny Madra, and other members of the Groq team as part of the December acquisition and is now integrating the technology into the Vera Rubin platform. The Groq 3 LPX rack will hold 256 LPUs and sit alongside the Vera Rubin rack-scale system shipping later this year, with the combined system increasing tokens-per-watt performance by 35 times.

The Rubin platform overall harnesses extreme codesign across hardware and software to deliver up to 10x reduction in inference token cost compared with Nvidia Blackwell. That 10-fold efficiency gain is real. Whether customers will pay the $45-per-million-token premium for Groq-accelerated speed remains an open question. What is clear is that the economics of inference are driving a new era of hardware specialisation.