OpenAI: Copyrighted data ‘impossible’ to avoid for AI training

11.01.2024

OpenAI has recently generated significant attention by asserting to a UK parliamentary committee that the development of today’s leading AI systems, such as ChatGPT, would be “impossible” without the utilization of extensive copyrighted data. The company argues that the training required for advanced AI tools necessitates a breadth of data, making adherence to copyright laws practically unfeasible.

In its written testimony, OpenAI contends that due to the expansive nature of copyright laws and the prevalence of protected online content, nearly every form of human expression would be inaccessible for training data. Whether it be news articles, forum comments, or digital images, a substantial portion of online content cannot be used freely and legally.

OpenAI asserts that attempts to create effective AI while avoiding copyright infringement would be unsuccessful, stating that limiting training data to public domain books and century-old drawings would not meet the requirements of today’s citizens.

While OpenAI acknowledges the potential need for partnerships and compensation arrangements with publishers to support creators, there is no indication that the company intends to significantly restrict its collection of online data, including content behind paywalls, such as journalism and literature.

This stance has exposed OpenAI to multiple lawsuits, including claims of copyright breaches from media outlets like The New York Times. Despite potential legal challenges, OpenAI seems reluctant to make fundamental changes to its data collection and training processes, citing the “impossible” constraints that self-imposed copyright limits would impose. Instead, the company aims to rely on broad interpretations of fair use allowances to legally leverage extensive amounts of copyrighted data.

As advanced AI continues to demonstrate remarkable abilities in emulating human expression, legal experts anticipate robust courtroom battles concerning infringement by systems inherently designed to absorb vast volumes of protected text, media, and other creative content. At present, OpenAI is placing its bet against copyright maximalists, favoring nearly boundless copying to propel ongoing AI development.