Nearly 200,000 books are being used by some of the biggest companies in technology to train their generative AI models, according to a report by The Atlantic. Books by famous authors including J.K. Rowling, Amitav Ghosh, Rupi Kaur, and Neil Gaiman are part of a dataset of pirated books known as Books3. However, no one has told the authors.
The collection of books includes erotic fiction to prose poetry genres. The report says that these books help generative AI systems with learning how to communicate information.
CNN report said that some AI training text can be pulled from articles that are posted on the internet. Books3 is already the subject of multiple lawsuits against Meta and other companies using the system to train AI.
Many authors took to social media to express their outrage and shared screenshots which showed that their copyrighted novels were part of the list.
‘Emergency Contact’ author Mary H.K.Choi took to social media after discovering her work was being used by AI. “I’m completely gutted and whipsawed. I am outraged and at the same time feel utterly helpless.” Ms Choi was a New York Times bestseller.
In an interaction with CNN, Ms Choi said, “A book encapsulates infinite choices, boundless permutations and even shortcomings of the author at the time. To think that all this life can be chucked into a vast churning pool to be extruded into a giant algorithmic, generative sausage machine reduces so much so swiftly,” she said. “Not just financially for the authors but it beggars booksellers, librarians, and readers from so many intimacies.”
Min Jin Lee, author of the novels “Pachinko” and “Free Food for Millionaires, also felt disappointed and called the use of her books a “theft.”
“I spent three decades of my life to write my books,” she said. “The Al large language models did not ‘ingest’ or ‘scrape’ ‘data.’ Al companies stole my work, time, and creativity. They stole my stories. They stole a part of me.”
Here to report a theft. I spent three decades of my life to write my books. The Al large language models did not “ingest” or “scrape”
“data.” Al companies stole my work, time, and creativity. They stole my stories. They stole a part of me. pic.twitter.com/tpFL2x9jgt
— Min Jin Lee (@minjinlee11) September 27, 2023
A spokesperson for Bloomberg told CNN, “The company had used a number of different data sources,” including Books3, to train its initial BloombergGPT model, an AI model for the financial industry. But, according to the spokesperson, Bloomberg will “not include the Books3 dataset among the data sources used to train future commercial versions of BloombergGPT.”
However, author James Chappel did not care his book was used in the database. “I want my book to (be) read!” he wrote. “I want it to educate!”