Britannica sues OpenAI for copyright infringement in ChatGPT training

Encyclopedia Britannica and its Merriam-Webster subsidiary have filed a lawsuit against OpenAI in federal court in Manhattan, accusing the company of using their reference materials without permission to train artificial intelligence systems including ChatGPT. The complaint, filed Friday, marks the latest and perhaps most symbolically significant challenge to OpenAI's training practices. That the guardians of 250 years of accumulated reference material are now in court speaks to a fundamental tension in the AI industry: who owns knowledge, and who has the right to profit from it.

According to the filing, OpenAI reproduced nearly 100,000 Britannica articles during the training process. The publishers allege that ChatGPT has been trained on and continues to reproduce their copyrighted content without authorisation, to the material detriment of both publishers. The complaint further claims that by presenting AI-generated responses, which may contain inaccuracies or hallucinations, alongside Britannica's and Merriam-Webster's famous trademarks and brand identities, OpenAI misleads users into believing that Britannica or Merriam-Webster has endorsed or is the source of those responses.

The commercial injury alleged goes beyond copyright. Britannica's business today is primarily digital, built on subscriptions and advertising revenue that depend on web traffic. When ChatGPT answers a user's question about, say, the causes of the French Revolution or the properties of a chemical element using content sourced from Britannica's articles, those users have less reason to visit Britannica's website. This logic mirrors complaints from news publishers: if the AI can answer the question, why click through to the original source?

The case hinges on a legal question that US courts have not yet resolved. OpenAI and other AI developers have maintained that training models on large collections of publicly available text qualifies as fair use, arguing that the technology transforms existing material into new outputs rather than reproducing it directly. An OpenAI spokesperson on Monday said ChatGPT's language models "are trained on publicly available data and grounded in fair use." But Britannica disputes this framing fundamentally. Encyclopedia Britannica asserts in the complaint that OpenAI's "misuse of plaintiffs' copyrighted works is also not transformative," arguing that "ChatGPT copies the expression, meaning and message of copyrighted content, including that of plaintiffs, and repackages it to the consumer."

The Britannica case does not exist in isolation. The case adds to a growing wave of copyright disputes between publishers and artificial intelligence developers over how training data is collected and used. Authors, news organizations, and other content owners have filed similar claims in recent months, arguing that AI companies built their systems on copyrighted material without obtaining permission. Last year the company filed a separate lawsuit against the startup Perplexity AI, which is still pending. More broadly, OpenAI is already the subject of a large multidistrict litigation in the Southern District of New York, currently overseen by Judge Sidney Stein, that consolidates more than a dozen copyright lawsuits brought by news publishers including the New York Times.

The licensing landscape complicates matters. Encyclopedia Britannica claims it reached out to OpenAI to discuss potential licensing opportunities, including an initial discussion in November 2024 that went nowhere. After that discussion, an OpenAI representative rebuffed plaintiffs' licensing outreach, and OpenAI never seriously pursued licensing plaintiffs' content. Instead, despite entering into licensing deals with other similar publishers, defendants continued to copy plaintiffs' content without compensating plaintiffs.

Reasonable people disagree on whether this represents theft of intellectual property or a necessary component of AI development. The tension reflects deeper questions: Should AI training count as fair use, or should creators and publishers have the right to demand payment? If AI cannot be trained without licensing vast libraries of content, does that make the technology economically unviable, or simply more honest? There is not a strong legal precedent that establishes whether or not using copyrighted content to train an LLM is copyright infringement or not. The courts will decide, but the outcome will reshape how the AI industry operates.