'Impossible' to train AI without copyrighted content says OpenAI

ChatGPT developer OpenAI has told the UK parliament that it is impossible to train its generative artificial intelligence (GenAI) services without access to copyrighted work.

The company, along with backer Microsoft, is facing a lawsuit from the New York Times, which has accused the AI tech company of “unlawful use” of its work to create its products.

Now in a submission to the House of Lords’ communications and digital select committee, the company appears to be angling for a relaxation of copyright laws.

The submission, first reported by the Telegraph, states: “Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

In a separate blog post published to its website on Monday, OpenAI responded to the lawsuit, saying: “We support journalism, partner with news organisations, and believe the New York Times lawsuit is without merit.”

In addition to the NYT suit, a group of authors including Game of Thrones writer George RR Martin are suing OpenAI for what they describe as “systematic theft on a mass scale”.

OpenAI has previously argued that while it respects content creators and owners, it also subscribes to a doctrine of “fair use” and that it believes that “legally, copyright law does not forbid training”.

GenAI training’s blurred lines surrounding copyright and plagiarism is increasingly becoming central to the conversation around the technology.

Image generation company Midjourney recently saw a spreadsheet containing the names of thousands of artists that have allegedly been used to train its tech go viral. The list includes the names of more than 4,700 artists whose works are said to have been ‘scraped’ to train the company’s tech, with thousands more listed under a ‘proposed additions’ tab.

The spreadsheet quickly spread across social media during the holiday period. One notable poster was Jon Lam, a senior storyboard artist at League of Legends-owner Riot Games, who posted screenshots from Discord where Midjourney developers, in his words, discuss “laundering” and creating a database from which they can train the software.

One of the messages reads: "All you have to do is just use those scraped datasets and then conveniently forget what you used to train the model. Boom legal problems solved forever."



Share Story:

Recent Stories


Bringing Teams to the table – Adding value by integrating Microsoft Teams with business applications
A decade ago, the idea of digital collaboration started and ended with sending documents over email. Some organisations would have portals for sharing content or simplistic IM apps, but the ways that we communicated online were still largely primitive.

Automating CX: How are businesses using AI to meet customer expectations?
Virtual agents are set to supplant the traditional chatbot and their use cases are evolving at pace, with many organisations deploying new AI technologies to meet rising customer demand for self-service and real-time interactions.