Water startup builds AI system after $200k loss to bad model advice

When Waterline Development tried to use large language models for materials science research, they were "confidently wrong in ways that cost us months," according to founder Derek Bednarski. The water desalination startup discovered this the hard way while designing a breakthrough desalination cell that removes salt and other ions using electrochemical technology.

The company was building a product they called a "water battery," and while choosing between carbon cloth and cast carbon electrodes, they relied on academic papers and AI systems like Grok and ChatGPT to validate their findings. They selected carbon cloth partly because it appeared frequently in academic papers, including a Stanford dissertation they had used as the basis for their initial prototypes.

The choice proved catastrophic at scale. The material had conductivity issues, retained water in ways that affected ion removal, and degraded faster than the alternative. The company spent four months and $200,000 validating that carbon cloth would not work past pilot scale, only to discover that cast carbon electrodes would have been superior from the start.

Chart of Rozum performance on Humanity's Last Exam — Rozum's performance on the Humanity's Last Exam benchmark shows improvement over leading frontier models across multiple scientific domains.

The problem was that commercial AI models are ill-suited to multidisciplinary research. "No single AI model does this reliably," the company found. "Frontier language models hallucinate under extended multi-step reasoning. They produce plausible answers that silently break when a problem crosses domain boundaries. At best this wastes time; at worst, it poisons critical decision making."

Rather than abandon AI entirely, Waterline created Rozum, a multi-model reasoning system that operates various AI models in parallel and synthesizes their answers through a verification layer. It is a model orchestration system that operates at inference time. The name comes from the Slavic word for "reason."

The verification layer cross-checks claims across models before generating a final response, and flagged unsupported claims in 76.2 percent of individual frontier model responses and caught source errors in 21.3 percent of responses. In testing on 1,000 PhD-level benchmark questions, only 5.5 percent produced clean consensus across all models.

The system is not designed for speed. Rozum can spend minutes or even hours working on responses, much more time than commercial AI models require, and so is not well-suited for real-time conversations, high-volume commodity queries, or tasks where current frontier models perform adequately. But Bednarski argues the trade-off is worthwhile for specific applications. Rozum is being used by early customers for high-stakes questions and decision-making, such as a $3 million dollar solar investment or allocating months of engineering time towards one R&D priority or another.

Rather than trying to integrate domain-specific tools or to make the work of human expert teams more efficient, Waterline created Rozum to let engineers, scientists, and analysts do their jobs better. Every query runs across multiple frontier models in parallel, with outputs evaluated, cross-checked, and verified before synthesis into a final response.

Rozum launched in March 2026 and is currently available through a limited early access programme. The company is headquartered in San Mateo, California.