In 2023, OpenAI told the British Parliament that it “impossible” to train leading AI models without using copyrighted material. It’s a popular view in the AI world, where OpenAI and other leading players have used material hoovered up online to train the models that power chatbots and image generators, sparking a wave of lawsuits alleging copyright infringement.
Two announcements on Wednesday provide evidence that large language models can actually be trained without the permissionless use of copyrighted material.
A group of researchers backed by the French government have reportedly released the largest AI training dataset consisting entirely of text found in the public domain. And the nonprofit Fairly Trained has announced that this is the case awarded its first certification for a major language model built without copyright infringement, showing that technology like the one behind ChatGPT can be built differently than the AI industry’s controversial standard.
“There is no fundamental reason why someone couldn’t train an LLM honestly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after leaving his leadership role at image generation startup Stability AI because he disagreed with its policy of scrapping content without permission.
Fairly Trained offers a certification to companies that want to prove they have trained their AI models on data they own, license, or are in the public domain. When the nonprofit launched, some critics pointed out that a major language model that met these requirements had not yet been identified.
Today, Fairly Trained announced that it has certified its first major language model. It’s called KL3M and was developed by Chicago-based legal technology consulting startup 273 Ventures, using a curated training dataset of legal, financial and regulatory documents.
The company’s co-founder Jillian Bommarito says the decision to train KL3M in this way came from the company’s “risk-averse” customers, such as law firms. “They are concerned about the provenance, and they need to know that the output is not based on contaminated data,” she says. “We don’t rely on fair use.” The customers were interested in using generative AI for tasks like summarizing legal documents and drafting contracts, but did not want to get involved in intellectual property litigation, as OpenAI, Stability AI and others have done.
Bommarito says 273 Ventures had not worked on a large language model before, but decided to train one as an experiment. “Our test to see if it was even possible,” she says. The company has created its own training dataset, the Kelvin Legal DataPack, which contains thousands of legal documents that have been reviewed for compliance with copyright law.
Although the data set is small (about 350 billion tokens, or data units) compared to those collected by OpenAI and others who have been scouring the internet en masse, Bommarito says the KL3M model performed much better than expected, something she attributes to how carefully the data had been checked beforehand. “Having clean, high-quality data can mean you don’t have to make the model as large,” she says. Compiling a dataset can help specialize a completed AI model for the task it was designed for. 273 Ventures is now offering waitlist spots to customers who want to purchase access to this data.
Clean slate
Companies looking to emulate KL3M may be able to get more help in the future in the form of freely available, infringement-free datasets. On Wednesday, researchers released the largest available AI dataset for language models consisting exclusively of public domain content. Common Corpus, as it is called, is a collection of text roughly the same size as the data used to train OpenAI’s GPT-3 text generation model and placed on the open source AI platform Hugging Face.
The dataset is built from sources such as public domain newspapers, digitized by the US Library of Congress and the National Library of France. Pierre-Carl Langlais, project coordinator for Common Corpus, calls it a “corpus large enough to train a state-of-the-art LLM.” In the jargon of big AI, the dataset contains 500 billion tokens. It is widely believed that OpenAI’s most capable model is trained on several trillions.