Two authors have filed a lawsuit against OpenAI, the company behind ChatGPT, alleging that their copyrighted books were used without permission to train the AI model. Mona Awad and Paul Tremblay claim that ChatGPT can generate 'very accurate summaries' of their works, indicating that the books were 'ingested' as training data.
The class action complaint, filed in a San Francisco federal court, argues that OpenAI unfairly profits from 'stolen writing and ideas'. The authors' lawyers, Joseph Saveri and Matthew Butterick, assert that books are ideal for training large language models due to their high-quality prose, but that OpenAI has acted as if copyright laws do not apply.
This is the first copyright lawsuit against ChatGPT, according to Andres Guadamuz, a reader in intellectual property law at the University of Sussex. He notes that the case will explore the uncertain legal boundaries of generative AI. However, proving financial losses from the use of copyrighted material may be challenging, as ChatGPT might perform similarly without the specific books.
The lawsuit also raises concerns about OpenAI's secrecy regarding its training data. The lawyers deduce that the size of the dataset, estimated at 294,000 titles, suggests it was sourced from shadow libraries like Library Genesis and Z-Library. The outcome may hinge on whether US courts consider this 'fair use' or unauthorised copying.
In the UK, a similar case would be decided differently due to the absence of a 'fair use' defence. The UK government had proposed an exception for text and data mining, but it was abandoned after opposition from authors and publishers. The Society of Authors has welcomed the lawsuit, having long warned about the wholesale copying of works for AI training.



