OpenAI’s ongoing legal disputes with The New York Times regarding data for AI training may still be brewing, but OpenAI is moving forward with collaborations with other publishers, including some of the largest news publishers in France and Spain.
OpenAI recently announced deals with Le Monde and Prisa Media to incorporate French and Spanish news content into OpenAI’s ChatGPT chatbot. According to a blog post by OpenAI, this partnership will provide users of ChatGPT with current event coverage from respected brands such as El País, Cinco Días, As, and El Huffpost, enhancing the training data for OpenAI.
OpenAI stated:
ChatGPT users will soon have access to relevant news content from these publishers through concise summaries, attributed links, and the option to explore additional information or related articles from the news sources. We are committed to improving ChatGPT and supporting the news industry’s role in delivering timely, reliable information to users.
So far, OpenAI has disclosed licensing agreements with several content providers, including:
- Stock media library Shutterstock (for images, videos, and music training data)
- The Associated Press
- Axel Springer (owner of Politico and Business Insider, among others)
- Le Monde
- Prisa Media
While the financial details of these deals have not been publicly disclosed by OpenAI, it has been estimated that the company may be paying between $4 million and $20 million annually for news content, based on previous reports from The Information.
Considering OpenAI’s substantial financial position with over $11 billion in funds and annual revenue surpassing $2 billion (source Financial Times), the financial commitments to publishers could potentially hinder AI competitors seeking similar licensing agreements, as highlighted by Hunter Walk, a partner at Homebrew and co-founder of Screendoor.
Walk expressed his concerns on his blog:
If the cost of experimentation is constrained by significant licensing deals, it could impede innovation. The substantial payments made to data providers are creating entry barriers for potential competitors. By setting high costs, companies like Google and OpenAI may hinder future competition.
While the debate continues on whether these licensing agreements create entry barriers for new AI players, many AI vendors have chosen not to obtain licenses for the data they use to train their AI models, opting for potential legal risks instead. For instance, the art-generating platform Midjourney reportedly utilizes Disney movie stills for training without a formal agreement with Disney.
The deeper question is whether licensing should be an inherent part of conducting business and exploring AI innovations. Walk suggests the need for a regulatory “safe harbor” to protect AI vendors, startups, and researchers from legal liabilities, given ethical and transparency standards.
Interestingly, the U.K. recently proposed measures to safeguard text and data mining for AI training from copyright limitations for research purposes, although these efforts did not materialize.
Amidst the ongoing debate, the challenge remains in striking a balance between compensating publishers fairly and enabling access to training data for AI challengers and academics, alongside major incumbents. Potential solutions could include grants or increased venture capital support in this space.
With the legal landscape still uncertain regarding the extent of fair use protections for AI vendors against copyright claims, it is crucial to address these complexities to avoid a scenario where a few dominant companies control vast pools of crucial training datasets, exacerbating talent drain in academia and limiting innovation opportunities.