The first wave of major generative AI tools were largely trained on “publicly available” Data – basically, anything that can be mined from the internet. Now, training data sources are increasingly restricting access and pushing for licensing deals. As the search for additional data sources has intensified, new licensing companies have emerged to keep the source material flowing.
He Dataset Providers Alliancea trade group formed this summer, wants to make the AI industry more standardized and fair. To that end, it has just published a position paper outlining its positions on major AI-related issues. The alliance is made up of seven AI licensing companies, including a music copyright management firm. CopyrightJapanese stock photo market Pixtaand a generative AI copyright licensing startup Calliope Networks(At least five new members will be announced in the fall.)
The DPA advocates for an explicit consent system, meaning that data can only be used after creators and rights holders give explicit consent. This represents a significant shift from the way most major AI companies operate. Some have developed their own explicit consent systems, which put the burden on data owners to take down their work on a case-by-case basis. Others offer no explicit consent at all.
The DPA, which expects members to respect its opt-in rule, sees this route as far more ethical. “Artists and creators should be involved,” says Alex Bestall, chief executive of Rightsify, the music data licensing company. World Copyright Exchangewho led the initiative. Bestall sees the option to participate as a pragmatic and moral approach: “Selling publicly available data sets is a way to get sued and not have credibility.”
Ed Newton-Rex, a former AI executive who now runs the ethical AI nonprofit Fairly Trained, says opt-outs are “fundamentally unfair to creators,” adding that some may not even know when they are offered. “It’s particularly good to see the DPA requiring opt-outs,” he says.
Shayne Longpre, the leader of the Data Provenance Initiativea volunteer collective that audits AI datasets, finds the DPA’s efforts to source data ethically admirable, though he suspects the opt-in standard could be a tough sell, given the sheer volume of data most current AI models require. “Under this regime, you’re either going to run out of data or you’re going to have to pay a lot,” he says. “It might be that only a few players, big tech companies, can afford to license all that data.”
In the document, the DPA opposes compulsory licensing by governments and advocates a “free market” approach in which data creators and AI companies negotiate directly. Other guidelines are more granular. For example, the alliance suggests five possible compensation structures to ensure that creators and rights holders are adequately remunerated for their data. These include a subscription-based model, a “usage-based license” (where fees are paid for use), and an “outcome-based” license, where royalties are tied to profits. “These could work for anything from music to images, movies, TV or books,” Bestall says.