Mistral's Leanstral: cheap AI code verification

When artificial intelligence generates code, someone still has to check whether that code actually works. This bottleneck is real, measurable, and increasingly expensive as organisations push AI agents into higher-stakes domains.

Mistral AI released Leanstral on March 16, 2026, the first open-source AI agent built specifically for Lean 4 formal verification. The move signals a shift in how the industry thinks about AI safety: not through human eyeballs, but through mathematical proof.

As AI code agents push into high-stakes domains, human review becomes the scaling bottleneck, with time and specialised expertise required to manually verify machine-generated outputs. Traditional testing finds bugs that happen to trigger in specific cases. Formal verification proves something stronger: that entire categories of failures cannot occur, regardless of input.

The economics matter. Leanstral at pass@2 reaches a score of 26.3, beating Claude Sonnet by 2.6 points while costing only $36 to run, compared to Sonnet's $549. Claude Opus 4.6 still leads in absolute quality with a score of 39.6, but at $1,650 per task, 46 times more expensive than Leanstral. The gap between best-in-class and affordable is historically vast; Leanstral narrows it.

The 120B parameter model runs on just 6B active parameters and ships under Apache 2.0 licensing, making production-grade theorem proving accessible without enterprise budgets. Open licensing matters more than it appears. Proprietary verification systems create dependency; open tools can be audited and owned by users rather than vendors.

How formal proof differs from testing

96% of developers distrust AI code, yet only 48% verify it, creating a "verification debt" bottleneck as 42% of production code is now AI-generated. The mistrust is warranted. Testing can prove that code works on the test cases you wrote. Formal verification proves that it cannot fail in ways that violate the specification.

Every theorem or program written in Lean 4 must pass strict type-checking by Lean's trusted kernel, yielding a binary verdict where there is no room for ambiguity—a property or result is proven true or it fails. No guessing. No "probably works".

The practical payoff is material. AWS has used formal methods since 2011 for critical systems, including verifying its Cedar authorisation policy engine with Lean, finding that formally verified code is often more performant than unverified code because bug fixes made during formal verification frequently improve runtime characteristics.

Real-world usage validates the approach. Fields Medalist Maryna Viazovska's proof that the E8 lattice is the optimal sphere packing in eight dimensions has been formally verified in Lean, with an AI agent completing the final steps. This is not academic theatre. This is frontier mathematics.

The cost-benefit trade-off

Nothing about formal verification is cheap in labour. Writing formal specifications takes time, often more time than writing the code itself, but the payoff is eliminating debugging entirely and providing absolute guarantees; for high-stakes code such as security primitives, financial transactions, and safety-critical systems, that trade is worth making.

The honest limitation: formal verification will not scale to every line of code in every application. A CRUD business application does not need mathematical certainty. A cryptocurrency smart contract does. A medical device does. A power grid does. The choice is not binary.

Formal verification has traditionally required either expensive auditing firms or deep in-house expertise, but an open-source agent that can prove code correctness at $36-290 per task could reshape how protocols approach security, assuming the proofs hold up under production conditions. That caveat matters; tooling is still maturing.

The wider ecosystem move

Mistral is not alone in betting on verified AI. Axiom Quant raised $200 million in Series A funding to scale formal verification and bring AI-generated code verification to every company using AI, valuing the company at $1.6 billion. Serious capital is flowing into this space.

AWS uses Lean and verification-guided development to formally verify Cedar, the AWS authorisation policy language, involving creating executable models in Lean, proving security properties, and validating them against production code through differential testing, achieving high assurance with minimal runtime overhead. This is enterprise-scale deployment, not research.

The release of Leanstral benchmarks against completing all formal proofs and correctly defining new mathematical concepts in the FLT project, instead of isolated mathematical problems, signals a shift toward realistic evaluation. Earlier benchmarks tested toy problems. This one tests production-like repositories.

The accountability question

As AI systems take on greater responsibility in code generation, the burden of verification cannot rest on human reviewers alone. Mistral envisions a more helpful generation of coding agents to both carry out their tasks and formally prove their implementations against strict specifications, where instead of debugging machine-generated logic, humans dictate what they want.

This shifts accountability. When AI generates code with a formal proof of correctness, failure means either the specification was wrong (a human decision) or the prover itself malfunctioned (rare, and catchable). Without formal proof, failure is ambiguous: the code was probably inadequate, but nobody is certain.

For institutions handling sensitive domains, ambiguity is a liability. Mistral's Leanstral offers a path away from it. Whether the market will adopt formal verification at scale remains uncertain; the economics now favour it.