Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

The Atlantic – Use our new search tool to see which authors have been used to train the machines. This summer, I acquired a data set of more than 191,000 books that were used without permission to train generative-AI systems by Meta, Bloomberg, and others. I wrote in The Atlantic about how the data set, known as “Books3,” was based on a collection of pirated ebooks, most of them published in the past 20 years. Since then, I’ve done a deep analysis of what’s actually in the data set, which is now at the center of several lawsuits brought against Meta by writers such as Sarah Silverman, Michael Chabon, and Paul Tremblay, who claim that its use in training generative AI amounts to copyright infringement. Since my article appeared, I’ve heard from several authors wanting to know if their work is in Books3. In almost all cases, the answer has been yes. These authors spent years thinking, researching, imagining, and writing, and had no idea that their books were being used to train machines that could one day replace them. Meanwhile, the people building and training these machines stand to profit enormously. Reached for comment, a spokesperson for Meta did not directly answer questions about the use of pirated books to train LLaMA, the company’s generative-AI product. Instead, she pointed me to a court filing from last week related to the Silverman lawsuit, in which lawyers for Meta argue that the case should be dismissed in part because neither the LLaMA model nor its outputs are “substantially similar” to the authors’ books. It may be beyond the scope of copyright law to address the harms being done to authors by generative AI, and the point remains that AI-training practices are secretive and fundamentally nonconsensual. Very few people understand exactly how these programs are developed, even as such initiatives threaten to upend the world as we know it. Books are stored in Books3 as large, unlabeled blocks of text. To identify their authors and titles, I extracted ISBNs from these blocks of text and looked them up in a book database. Of the 191,000 titles I identified, 183,000 have associated author information. You can use the search tool below to look up authors in this subset and see which of their titles are included…”

Sorry, comments are closed for this post.