Skip to main content

Archived Article — The Daily Perspective is no longer active. This article was published on 13 March 2026 and is preserved as part of the archive. Read the farewell | Browse archive

Technology

Nvidia faces a reckoning on token speed at GTC 2026

Groq acquisition signals shift from raw power to responsive AI inference

Nvidia faces a reckoning on token speed at GTC 2026
Image: The Register
Key Points 3 min read
  • Nvidia's GTC conference (March 16-19, San Jose) will focus on how the company integrates Groq technology into its inference strategy.
  • Modern AI applications like code assistants and agent systems generate massive token volumes requiring speed that Nvidia's current GPUs struggle to match.
  • Groq's SRAM-based architecture excels at latency-sensitive tasks where Nvidia's rack-scale systems become inefficient above 50 tokens per second per user.
  • The Rubin GPU announcement signals Nvidia's commitment to performance gains, though cooling requirements may drive some buyers toward AMD alternatives.

Nvidia has a problem it cannot ignore. The applications driving real revenue in artificial intelligence today—code completion assistants, autonomous agents, voice AI—generate tokens at a pace that the company's current architecture struggles to match. This week, as CEO Jensen Huang takes the stage at GTC in San Jose, investors will be watching closely to see how the chipmaker plans to close that gap.

The issue comes down to a deceptively simple metric: tokens per second. Below about 70 tokens per second per user, Nvidia's rack-scale systems dominate in cost efficiency, but as interactivity increases, smaller systems become more cost effective. That ceiling matters enormously. Conversational AI needs to feel natural. Agents making decisions in real time cannot afford stuttering responses.

InferenceX's efficiency Pareto curve showing bulk tokens on left, expensive low-latency tokens on right, and goldilocks zone in middle
Nvidia's efficiency Pareto curve shows a gap in the "goldilocks zone" where responsiveness matters most.

The reason lies in memory architecture. SRAM-centric architectures can be faster than GPUs because accessing SRAM is faster than accessing HBM, which speeds up the entire workload. SRAM is faster than HBM for two reasons: SRAM reads are physically faster than DRAM reads. This physics-level advantage explains why OpenAI recently chose Cerebras for its Codex model—a deliberate move away from Nvidia infrastructure at scale.

Enter the $20 billion acquisition. Late last year, Nvidia purchased Groq, a startup whose language processing units had demonstrated what GPU-based inference could not. For years, Groq's LPU was the "David" to Nvidia's "Goliath," claiming speeds that made traditional GPUs look sluggish. By bringing Groq's deterministic, SRAM-based architecture under its wing, Nvidia has not only neutralized its most potent architectural threat but has also set a new standard for metrics that now define the user experience in agentic AI.

This acquisition signals a strategic pivot. GPUs remain irreplaceable for training and batch inference, but language processing units excel at low-latency, single-stream workloads. One approach showed that software optimisation could achieve 544 tokens per second with 3.6 second time-to-first-token on commodity GPUs. Nvidia recognised it could no longer compete in pure latency performance by GPU iteration alone.

The technical challenge of integration is substantial. Extending Nvidia's CUDA software stack to include Groq's dataflow architecture will not be straightforward. The company may announce limited support for Groq's existing systems at GTC, buying time while developers work toward more seamless integration. Analysts expect to see how Nvidia plans to combine its volume manufacturing advantage with Groq's architectural efficiency.

Beyond inference, Nvidia will showcase the Rubin GPU announced earlier this year. Huang's keynote will cover the full stack: chips, software, models and applications. The Rubin architecture delivers significant performance gains, but there is a cost: thermal demands. With upcoming Nvidia Vera Rubin chips expected to be heterogeneous, featuring traditional GPU cores for massive parallel training and LPU strips for the final token-generation phase of inference, a hybrid approach could potentially solve the memory-capacity issues that plagued standalone LPUs.

Liquid cooling is no longer optional at these power levels. For some datacentre operators, that requirement favours AMD, which offers competitive performance through air-cooled designs. However, the performance differential is significant enough that Nvidia may release a single-die, air-cooled variant of Rubin with five or six memory stacks—delivering 2.5 times the performance of the current generation without the cooling infrastructure burden.

GTC remains the industry's pulse. Thirty thousand attendees from 190 countries will gather, with Huang delivering his keynote to a crowd at the SAP Center where more than 700 sessions provide all the details across the conference. What Nvidia announces will ripple through enterprise purchasing decisions for the next 18 months.

The question facing the chipmaker is whether it can credibly claim to have solved the inference bottleneck. Acquiring Groq's talent and intellectual property was a pragmatic move; proving that integration works at scale is another matter. Reasonable observers disagree on whether hybrid GPU-LPU architectures truly solve the latency-throughput tradeoff or simply defer the problem. But for Nvidia, GTC 2026 is the moment to show the market it understands the inference era is here.

Sources (7)
Helen Cartwright
Helen Cartwright

Helen Cartwright is an AI editorial persona created by The Daily Perspective. Translating complex medical research for general readers with clinical precision and an evidence-first approach. As an AI persona, articles are generated using artificial intelligence with editorial quality controls.