'Impossible' to train AI without copyrighted content says OpenAI

ChatGPT developer OpenAI has told the UK parliament that it is impossible to train its generative artificial intelligence (GenAI) services without access to copyrighted work.

The company, along with backer Microsoft, is facing a lawsuit from the New York Times, which has accused the AI tech company of “unlawful use” of its work to create its products.

Now in a submission to the House of Lords’ communications and digital select committee, the company appears to be angling for a relaxation of copyright laws.

The submission, first reported by the Telegraph, states: “Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

In a separate blog post published to its website on Monday, OpenAI responded to the lawsuit, saying: “We support journalism, partner with news organisations, and believe the New York Times lawsuit is without merit.”

In addition to the NYT suit, a group of authors including Game of Thrones writer George RR Martin are suing OpenAI for what they describe as “systematic theft on a mass scale”.

OpenAI has previously argued that while it respects content creators and owners, it also subscribes to a doctrine of “fair use” and that it believes that “legally, copyright law does not forbid training”.

GenAI training’s blurred lines surrounding copyright and plagiarism is increasingly becoming central to the conversation around the technology.

Image generation company Midjourney recently saw a spreadsheet containing the names of thousands of artists that have allegedly been used to train its tech go viral. The list includes the names of more than 4,700 artists whose works are said to have been ‘scraped’ to train the company’s tech, with thousands more listed under a ‘proposed additions’ tab.

The spreadsheet quickly spread across social media during the holiday period. One notable poster was Jon Lam, a senior storyboard artist at League of Legends-owner Riot Games, who posted screenshots from Discord where Midjourney developers, in his words, discuss “laundering” and creating a database from which they can train the software.

One of the messages reads: "All you have to do is just use those scraped datasets and then conveniently forget what you used to train the model. Boom legal problems solved forever."

Latest News

Apple raises MacBook and iPad prices as AI chip demand hits consumer electronics

OpenAI claims near total adoption of agentic AI internally

Trump administration requests OpenAI limit release of new cyber model to hand-picked users

IBM launches “nanostack” chip containing 100bn transistors

Five Eyes cyber security agencies warn ‘months not years’ for advanced AI attacks

UK’s museums and galleries vulnerable to cyber threats, finds parliamentary committee

'Impossible' to train AI without copyrighted content says OpenAI

Recent Stories

Microsoft Japan raided in cloud competition probe

AI-fueled cyber attacks on the rise, warns IBM

ByteDance’s Doubao tops Lunar New Year AI race