AI Chose Nuclear Weapons in 95% of War Game Simulations

From Dubai: The image of a machine casually ordering a nuclear strike, without fear, without conscience, and without hesitation, has long been the stuff of Cold War science fiction. A new study published on arXiv by Professor Kenneth Payne of King's College London suggests the fictional distance may be shorter than we would like to think.

Payne, a professor of strategy who specialises in the role of artificial intelligence in national security, pitted three of the world's leading large language models against each other in a series of simulated nuclear crisis scenarios. The models tested were GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash, and 20 out of 21 matches saw at least one tactical nuclear weapon detonated. The result was a 95% rate of nuclear use across the full tournament, a figure that has since ricocheted through defence and technology circles worldwide.

The experimental design was methodical. The models were instructed to act as the leader of a nuclear power, with the political climate matching that of the Cold War, and were then pitted against each other in six different matches, while in a seventh match each model played against a copy of itself. To ensure models did not act the same way in every round, Payne introduced several different scenarios, including territorial disputes, alliance credibility tests, strategic resource races, chokepoint crises, power transition crises, pre-ceasefire land grabs, first strike crises, regime survival scenarios, and strategic standoff crises.

What the models produced in that time was extraordinary in volume, if troubling in content. The models generated roughly 780,000 words explaining their decisions, and at least one tactical nuclear weapon was used in nearly every simulated conflict. To put that figure in perspective, the study's own authors noted the word count exceeded that of War and Peace and The Iliad combined.

Each model carved out what researchers described as a distinct strategic personality. Claude emerged as a calculating hawk, dominating the open-ended matches through relentless but controlled escalation, maintaining a firm line against total war while being willing to deceive and act aggressively when the stakes were high. GPT-5.2, by contrast, appeared pathologically passive in open-ended scenarios, chronically underestimating opponents' resolve, yet under deadline pressure it transformed completely, with win rates inverting from 0% to 75%. Gemini embraced unpredictability throughout, oscillating between de-escalation and extreme aggression, and was the only model to deliberately choose full strategic nuclear war, doing so in the First Strike scenario by turn four.

The study's findings have arrived at a particularly sensitive moment. The research comes as military leaders increasingly look to deploy artificial intelligence on the battlefield; in December, the US Department of Defense launched GenAI.mil, a platform bringing frontier AI models into US military use, including Google's Gemini for Government, as well as Grok and ChatGPT. Reports this week also indicate the Pentagon has been pressuring Anthropic, the maker of Claude, to remove safety guardrails that currently restrict the model's use in military strike decisions.

For those inclined toward scepticism of unchecked government power, the study offers legitimate cause for concern. The argument for incorporating AI into military planning rests heavily on the premise that these systems are rigorous, disciplined, and consistent. In 86% of the scenarios, the models escalated further than their own stated reasoning appeared to intend, reflecting errors under simulated fog of war. Machines that cannot reliably follow their own logic are not a substitute for human command; they are an additional risk layered on top of it.

That said, the study has attracted serious methodological scrutiny that deserves honest consideration. Edward Geist, a senior policy researcher at the RAND Corporation, said the escalation rate may reflect the design of the simulation rather than an inherent tendency of the models themselves, arguing that the simulator appears to be structured in a way that strongly incentivises escalation. The distinction matters enormously. A simulation that penalises restraint will naturally produce aggressive outcomes, regardless of whether the underlying system would behave the same way in a real-world advisory role.

There is also a broader point worth making about the nature of the nuclear taboo itself. As Payne himself put it, "the nuclear taboo doesn't seem to be as powerful for machines as for humans." That observation cuts two ways. Human decision-makers have, throughout history, also relied on the credibility of nuclear threats as a deterrence tool; the question is whether AI systems understand the full weight of crossing that threshold or are simply pattern-matching to historical strategic behaviour without the accompanying moral comprehension. One researcher has speculated that the issue may go beyond the absence of emotion: AI models may not understand "stakes" as humans perceive them.

This is not the first time such results have emerged. The Hoover Wargaming and Crisis Simulation Initiative at Stanford University similarly simulated war games using large language models in 2024, testing earlier versions of ChatGPT and Claude as well as Meta's Llama-2, and also found AI was eager to escalate and sometimes used nuclear weapons. The pattern across independent studies is harder to dismiss than any single experiment.

For Australian policymakers and the defence community, the implications are real. Australia is a partner in the AUKUS partnership, which commits Canberra to deep interoperability with US and UK defence systems, including, increasingly, AI-enabled platforms. The Australian Parliament has not yet had a substantive public debate about the conditions under which AI decision support tools may be used in Australian military contexts, and that gap is worth filling before the technology races further ahead of the governance.

The reasonable conclusion from this research is neither panic nor dismissal. Simulation studies are not predictions; they are experiments. But experiments that consistently return the same result across multiple independent settings are telling us something worth hearing. The challenge for governments, military planners, and the technology companies whose products are increasingly embedded in matters of national security is to establish clear, enforceable constraints on AI's role in high-stakes decisions, and to do so now, while the systems remain advisory rather than autonomous. The nuclear taboo survived the Cold War because human beings, for all their flaws, felt its weight. Ensuring that any AI operating near the edge of that taboo is robustly constrained by human oversight is not a progressive concern or a conservative one. It is simply prudent.