Nvidia Vera Rubin Platform Cuts AI Inference Costs by 90%

Nvidia announced the Vera Rubin platform with seven new chips now in full production, representing a fundamental rethink of how the largest data centres should be designed. Rather than selling GPUs as discrete products, Nvidia is now packaging 40 racks containing 1.2 quadrillion transistors, nearly 20,000 Nvidia dies, 1,152 Rubin GPUs, and delivering 60 exaflops of compute, with the entire system optimised to work as one coherent machine.

The economics are striking. The core NVL72 rack delivers up to 4x better training performance and up to 10x better inference performance per watt, and one-tenth the token cost relative to Nvidia Blackwell. For organisations running large language models and agentic AI systems, that tenfold reduction in cost per million tokens translates directly to the bottom line as inference workloads scale.

What makes Vera Rubin different is the extreme co-design across the entire stack. The platform houses the Vera CPU, Rubin GPU, an NVLink 6 Switch, a ConnectX-9 SuperNIC, a BlueField-4 DPU, and Nvidia Spectrum-6 Ethernet switch, as well as the newly integrated Groq 3 LPU. Each chip addresses a specific bottleneck in modern AI workloads. The Vera CPU uses 88 custom Arm-based Olympus cores with Spatial Multithreading for 176 threads, up to 1.5TB of LPDDR5X memory, and 1.2 TB/s of memory bandwidth, designed specifically for the data movement demands of agentic AI rather than traditional server workloads.

As context windows grow to hundreds of thousands and in some cases millions of tokens, operations on the KV cache have become a bottleneck, and the BlueField-4 STX storage tier is designed to break through it, with Nvidia claiming up to five times higher inference throughput when deployed alongside compute racks.

HPE is moving quickly to operationalise the platform. The Nvidia Vera Rubin NVL72 by HPE rack-scale system will be available in December 2026, with significant innovations to the Nvidia AI Computing by HPE portfolio focused on large-scale AI factories and supercomputers, featuring tightly integrated compute, GPUs, networking, liquid cooling, software, and services designed for at-scale and sovereign environments.

The sovereign AI angle matters here. HPE will help build AI factories at Argonne National Laboratory in the United States and the High-Performance Computing Center Stuttgart in Germany, enabling governments, research institutions, and businesses to quickly deploy, operate, and scale AI initiatives while adhering to regional data sovereignty and compliance requirements. HPE will build and install the supercomputer for the European Union AI Factory, HammerHAI, with a consortium of leading academic HPC centres in Germany coordinated by HLRS leading this effort.

Nvidia CEO Jensen Huang said he expects purchase orders between Blackwell and Vera Rubin to reach $1 trillion through 2027, which conveys the scale of infrastructure being deployed. Vera is already in full production and partner availability will begin in the second half of 2026. Support spans Amazon Web Services, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, Alibaba, ByteDance, CoreWeave, Lambda, Nebius, OpenAI, Anthropic, Meta and Mistral AI.

The real question is whether this represents genuine innovation or marketing packaging. Vera Rubin is a complete rethinking of what an AI computing system is, how it's structured, and how the different chips inside it are supposed to work together. By optimising entire data centre facilities rather than individual racks, Nvidia has shifted the unit of economics. Companies no longer ask what a GPU costs, but what a fully integrated 60-exaflop factory costs. For enterprises planning AI infrastructure investments, the economics of AI compute are shifting from chip-level to facility-level optimisation, with those building out data centres now facing a choice between current-generation systems and waiting for Vera Rubin availability in late 2026.