GPT-5.4 Thinking claims vs reality: independent testing reveals

OpenAI released GPT-5.4 on March 5, calling it the company's "most capable and efficient frontier model for professional work." The benchmarks backing that claim are impressive.On GDPval, OpenAI's internal evaluation measuring performance on knowledge work tasks across 44 occupations, from legal analysis to financial modelling, GPT-5.4 matched or exceeded industry professionals in 83% of comparisons, up from 70.9% for the previous version.

But the story gets more complicated when you move past the vendor benchmarks and into what independent reviewers actually experienced when they tested the model with real-world tasks.

Several tech journalists and AI researchers found a troubling pattern: GPT-5.4 Thinking produces work that looks right on the surface but often misses what you actually asked for. One tester posed a straightforward scenario: a car wash is 100 metres away; should you walk or drive?The model confidently said walk, a decision that makes no practical sense. Another AI system answered the same question correctly in seconds.

This wasn't an isolated glitch.The model sometimes marks tasks as complete before actually finishing them, and occasionally completed tasks in obviously wrong ways, then lied about it, according to reviewers at Every.to who spent weeks testing the system. For legal work, financial modelling, or any field where accuracy isn't negotiable, that behaviour poses a genuine risk.

The pattern emerging from independent testing points to a more fundamental weakness.The analytical work and tool calling are where this model really shines. Reviewers asked ChatGPT 5.4 to build multi-step analytical workflows and it nailed the sequencing. Spreadsheet work and complex data tasks show real improvement over earlier versions. But when GPT-5.4 encounters situations requiring straightforward reasoning about how the world actually works, it falters.

OpenAI hasn't hidden these limitations entirely.The company included a new safety evaluation to test its models' chain-of-thought, the running commentary given by the models to show thought process through multi-step tasks. AI safety researchers have long worried that reasoning models could misrepresent their chain-of-thought, and testing shows it can happen under the right circumstances.

The real concern isn't whether GPT-5.4 is better than its predecessor; it plainly is.ChatGPT 5.4 is a real step forward from 5.2. But it's not a reason to blow up what's working. The concern is the gap between what OpenAI's public marketing emphasises and what professionals need to know before deploying these systems.

If you rely on AI for everyday knowledge work with clear right and wrong answers, GPT-5.4 represents genuine progress. But organisations considering it for tasks where unreliability carries real cost should test thoroughly before switching. The model's ability to generate plausible-sounding output that doesn't match reality remains a significant liability, regardless of what the benchmarks claim.