AI Token Economics: Why Software Beats Hardware at Scale

The business case for AI inference seems almost trivial: more power goes in, more tokens come out. Sell enough tokens to cover costs, pocket the rest as profit. But like all industrial analogies, this one breaks down the moment you try to scale it.

In reality, operators of AI datacentres face a ruthless trade-off that sits at the heart of modern infrastructure economics. You can extract maximum tokens from each megawatt of power, but only by letting latency balloon. Users queue for responses measured in seconds. Or you can deliver snappy replies, but at the cost of processing far fewer tokens per watt. There is no free lunch.

InferenceX efficiency Pareto curve showing bulk tokens on left, expensive low-latency tokens on right, and goldilocks zone in the middle — The efficiency frontier: bulk tokens are cheap but slow; low-latency tokens are fast but expensive. The "Goldilocks zone" balances both.

Inference efficiency curves can be broken down into three main categories: bulk tokens on the left (cheap but glacially slow), expensive low-latency tokens on the right, and the so-called Goldilocks zone in the middle. Choosing where you operate on this curve is not a technical decision. It is a business decision.

This is where inference differs fundamentally from training. Training is a fixed cost; you spend months and millions preparing a model once. Inference is a variable cost that scales with use. Every token your users consume costs you real money. That makes the economics of token generation perhaps the defining business problem in AI right now.

The Software Leverage Play

What makes this problem genuinely difficult is that hardware is only half the equation.The SemiAnalysis InferenceX benchmark offers arguably the best look into the performance scaling and economics of generative AI inference yet. It also reveals something uncomfortable: which inference software you choose matters as much as which GPU you buy.

Performance comparison showing Nvidia TensorRT LLM significantly outperforming SGLang on same hardware — Software makes the difference: TensorRT LLM achieved superior efficiency than SGLang on identical Nvidia B200 hardware.

Consider what happened with AMD's latest accelerators. Less than a month ago, AMD's MI355X chips trailed Nvidia's equivalents by a wide margin in popular inference frameworks.Within seven days of optimisation and tuning, distributed inference performance on AMD MI355X GPUs for DeepSeek has been drastically improved, underscoring the pace at which the stack is evolving. The hardware did not change. The software did. And suddenly AMD was competitive again.

This creates a permanent state of flux.AMD product managers note that "the state of the art of AI is very much a moving target" and that companies are continuing to optimise both software and hardware to address that state of the art. Teams that fail to update their software stacks monthly are leaving money on the table.

The Disaggregation Gambit

The most significant efficiency breakthrough in recent months has come from a deceptively simple insight: the prefill phase (processing your input prompt) and the decode phase (generating your response) are fundamentally different workloads. One is compute-heavy and bursty. The other is memory-bandwidth-bound and steady. Running both on the same GPU creates interference.

Disaggregated serving splits prefill and decode workloads across different GPU pools — Splitting the workload: dedicated prefill GPUs and decode GPUs reduce latency and boost throughput simultaneously.

Disaggregated serving separates prefill (compute-heavy and bursty) from decode (memory-bandwidth-bound and steady state). When they share the same GPUs, they interfere with each other, causing latency jitter and wasted capacity. Separating them onto dedicated GPU pools lets each run a workload matched to its characteristics, improving effective utilisation.

This is why Nvidia's larger rack-scale systems now dominate at high throughputs.Enterprise-focused smaller systems perform well where user interactivity is low, but run out of steam above about 50 tokens per second per user. Rack-scale systems meanwhile maintain higher interactivity without compromising throughput. The downside is cost and complexity. For many operators, the smaller systems still make economic sense.

When Software Stacks Trump Chips

This year has made clear just how fluid the competitive landscape remains.Leading inference providers using Nvidia Blackwell are reducing cost per token by up to 10x compared with Nvidia Hopper. But raw chip advantage means little without matching software optimisations.

The same is true in reverse.Nvidia leads in optimised training pipelines, while AMD dominates in high-capacity inference scenarios. Yet dominance is fragile. When software improves, rankings shift. When new models release at lower precisions, everything recalibrates.

Rack-scale architectures like Nvidia GB200 NVL72 deliver superior efficiency at higher throughputs — Scale wins differently: rack-scale systems achieve better efficiency at higher throughputs and lower latency requirements.

For the businesses deploying this infrastructure, the pragmatic reality is unavoidable. There is no single "best" GPU. There is no universal optimal configuration. The answer depends entirely on your application.

Inference accounts for 85 per cent of enterprise AI budgets. As companies move from experimental chatbots to thousands of autonomous workflows running 24/7, the sheer volume of tokens consumed creates a massive budgetary leak. This is why precision matters now:latest Blackwell and Instinct GPUs offer native FP4 acceleration. The economics of inference strongly favour lower precisions because smaller model weights need less memory capacity, bandwidth, and compute to achieve the same level of performance as higher precision models.

The Emerging Pragmatism

For now, Nvidia is the only vendor with a mature rack-scale platform. However, AMD's MI455X-based Helios rack systems are due out in the second half of 2026 and boast performance that, at least on paper, is on par with Nvidia's next-generation racks. The market will fragment further. Different applications will optimise for different points on the efficiency frontier.

This is healthy. Monopoly breeds complacency. Competition drives genuine innovation in software and hardware. But it also means operators must treat infrastructure as a continuous discipline, not a one-time purchase. The team that updates frameworks monthly and measures token economics weekly will consistently beat the team that does not.

The hard truth about AI tokenomics is that there is no secret; there is only work. The businesses that win will not be those with the smartest models or the most expensive chips. They will be the ones disciplined enough to measure their own economics obsessively and ruthless enough to optimise constantly.

For the centre-right case on infrastructure policy, this is actually reassuring. The market is self-correcting. Nvidia's dominance is being genuinely challenged. Software ecosystems are opening up.Companies are increasingly running inference on their own hardware to avoid high markups of cloud-based APIs, utilising on-premise servers for internal tasks where the marginal cost of an additional token drops toward zero. This suggests the industry is moving toward decentralised, heterogeneous deployments rather than hyperscale monoculture.

The hard part was never the technology. It was always the discipline to measure what matters and act on what you measure. That remains true.