Posted: 26/10/2023
The two-year partnership between the Associated Press (AP) and OpenAI, the parent company of ChatGPT, is one that has the potential to influence the way developers access and exploit valuable data sets to train AI. Whilst some of the key details of the deal have not been disclosed, including the commercial terms, what has been shared is that the AP has agreed to license some of its text archive to OpenAI. Rights holders, particularly in the publishing industry, have previously expressed their frustration that developers are unfairly exploiting copyright protected works, but this deal could highlight a move from the industry to take back some control over the use of such copyright protected content.
The AP has entered into a deal with OpenAI to grant access to its text archive dating back to 1985, and OpenAI intend to license this data set for use in training its AI systems. As previously discussed, developers require substantial quantities of high-quality data to train AI systems. Questions have been raised by rightsholders as to how such data is obtained and used in training AI, as it appears that developers are not adopting rightsholders’ traditional commercial model of fee-paying licences for commercial use access.
Through this deal with OpenAI, the AP appears to be taking the matter into its own hands by re-examining its commercial approach and acknowledging that a balance must be struck between the rights of publishers, and other copyright holders, and the need for AI developers to access huge data sets.
Interestingly, in exchange, the AP will be granted access to OpenAI’s technology and product expertise. The AP does leverage some AI in its existing operations, such as automated story previewing and recapping at some sporting events, as well as using AI to assist in the transcription of audio and video from live events. The AP also established its Local News AI project, which helps local news providers integrate artificial intelligence. The company has not yet revealed how it will integrate OpenAI’s technology into its news operations, but the agreement suggests that it is considering not only how it can leverage the technology, but also how the copyright protected work it produces can be utilised in the necessary training of AI.
The AP has joined OpenAI’s growing list of partners. On 11 July 2023, OpenAI announced a six-year deal with Shuttershock whereby Shuttershock will license images, videos, music and meta-data on its platform to OpenAI in order to train its text-to-image model, DALL-E.
The UK has not yet moved to regulate the specific use of copyright protected content in training AI solutions, preferring to rely on existing legislation. The government had suggested that the UK’s existing intellectual property regime was generally sufficient to regulate the use of copyright protected material, but then proposed that the existing text and data mining exemption should be extended to allow text and data mining for commercial uses, such as in the training of AI systems for commercial applications.
This proposal caused concern within the publishing industry, with fears that copyright protected content could be used by AI developers without restriction, or any obligation to pay financial remuneration to the organisations who owned the copyright protected work. The proposal would have been welcomed by AI developers as it would have allowed their commercial AI solutions to be trained on information that was scraped from copyright protected source material without the need to agree licences with the rights holders.
However, the government has since pulled back from this proposal following opposition from publishers and other producers of such content. It is currently in discussions with stakeholders on each side with the aim of establishing a new voluntary regime which would assist AI system development, whilst also protecting the rights of copyright holders.
The EU’s proposed AI Act would impose disclosure requirements on AI developers – requiring them to improve transparency around the source and nature of the data used to train AI. This disclosure requirement will likely in turn lead AI developers to need to enter into many more agreements like the one with the AP. This will allow copyright owners the ability to control access and the commercial terms on which such access is granted.
In contrast to the AP deal mentioned above, it has recently been reported that Immediate Media, who license the BBC trademark for use on the BBC Good Food website, has blocked ChatGPT from scraping recipes from its website in order to prevent ‘robot chef’ chatbots replacing traditional recipe and cookbook publishers. This protective approach is in stark contrast to the deal between the AP and OpenAI, but BBC Good Food is not alone in taking practical steps to prohibit such activity by AI developers, with the New York Times, CNN, Amazon and Disney reportedly also taking steps to block ChatGPT from being able to scrape data from their websites.
If OpenAI continues to access this data despite the rights holders prohibiting ChatGPT’s use of their content, it is likely to be infringing upon the intellectual property rights of the companies or individuals that produce such content, as the AI system will be accessing copyright protected works without permission and without a licence to do so.
In being one of the first partnerships of this kind, OpenAI and the AP may have set the precedent for content producers and AI developers working together in this way. Businesses that produce the kinds of copyright protected works that are being used to train AI could consider whether a similar commercial model would be useful in leveraging their own intellectual property, and whether such an agreement may be appropriate in the (current) absence of any specific legislation to protect such rightsholders.
Alternatively, if businesses don’t agree to offer licences to AI developers they may, at least in the short term, restrict them from using their content using a similar approach to that of Immediate Media and by taking practical steps to prohibit such data harvesting.