Harvard University announced Thursday that it will release a high-quality data set of nearly a million public domain books that anyone could use to train large language models and other artificial intelligence tools. The dataset was created by the newly formed Harvard Institutional Data Initiative with funding from Microsoft and OpenAI. Contains books scanned as part of the Google Books project that are no longer protected by copyright.
About five times the size of the famous Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative database spans genres, decades and languages, with classics by Shakespeare, Charles Dickens and Dante included along with obscure Czech mathematics textbooks. and pocket Welsh dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to a type of highly refined information and curated content repositories that typically only established tech giants have the resources to assemble. “It’s gone through a rigorous review,” he says.
Leppert believes the new public domain database could be used alongside other authoritative materials to build artificial intelligence models. “I think about it a little bit as Linux has become a core operating system for much of the world,” he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.
Burton Davis, vice president and deputy general counsel for intellectual property at Microsoft, emphasized that the company’s support for the project was in line with their broader beliefs about the value of creating”accessible data sets” for use by AI startups and “managed in the public interest.” In other words, Microsoft doesn’t necessarily plan to exchange all the AI training data it has used in its own models with public domain alternatives like the books in Harvard’s new database. “We use publicly available data to train our models,” says Davis.
As dozens of lawsuits filed over the use of copyrighted data to train AI make their way through the courts, the future of how AI tools are built is at stake. If AI companies win their cases, they will be able to continue browsing the Internet without needing to enter into licensing agreements with copyright holders. But if they lose, AI companies could be forced to overhaul the way their models are made. A wave of projects like the Harvard database advances under the assumption that, no matter what, there will be an appetite for public domain data sets.
In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers that are now in the public domain, and says it is open to forming similar collaborations in the future. The exact way in which the books data set will be published has not been determined. The Institutional Data Initiative has asked Google to work together on public distribution, but the search giant has not yet publicly agreed to host it, although Harvard says it is optimistic it will do so. (Google did not respond to WIRED’s requests for comment.)