Copyrighted books to train AI? Fair. Storing them? Not so much.

Simon Willison’s WeblogAnthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books. Major USA legal news for the AI industry today. Judge William Alsup released a “summary judgement” (a legal decision that results in some parts of a case skipping a trial) in a lawsuit between five authors and Anthropic concerning the use of their books in training data. The judgement itself is a very readable 32 page PDF, and contains all sorts of interesting behind-the-scenes details about how Anthropic trained their models. The facts of the complaint go back to the very beginning of the company. Anthropic was founded by a group of ex-OpenAI researchers in February 2021. According to the judgement:

So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated. Anthropic’s next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.

Books3 was also listed as part of the training data for Meta’s LLaMA training data! Anthropic apparently used these sources of data to help build an internal “research library” of content that they then filtered and annotated and used in training runs. Books turned out to be a very valuable component of the “data mix” to train strong models. By 2024 Anthropic had a new approach to collecting them: purchase and scan millions of print books!..”

  • Authors Alliance: “Yesterday, Judge Alsup released his decision on Anthropic’s motion for summary judgment in the fast-moving lawsuit it is defending, brought by three book authors on behalf of a class of millions objecting to Anthropic’s use of books for training its LLMs. We’ve recently posted about other aspects of the case related to the class action aspects, which are still pending, and the potential for settlement in this suit. The decision represents a major win for Anthropic in that the decision found that its training AI on lawfully acquired copyrighted works was a fair use. Anthropic lost, however, on the issue of downloading pirated books to create a “central library” and more is still to come on the issue of Anthropic using those works for AI training.
  • The Atlantic September 25, 2023 [no paywall] – “These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech. Use our new search tool to see which authors have been used to train the machines.This searchable database is part of The Atlantic’s series on Books3. You can read about the origins of the database here, and an analysis of what’s in it here. This summer, I acquired a data set of more than 191,000 books that were used without permission to train generative-AI systems by Meta, Bloomberg, and others. I wrote in The Atlantic about how the data set, known as “Books3,” was based on a collection of pirated ebooks, most of them published in the past 20 years. Since then, I’ve done a deep analysis of what’s actually in the data set, which is now at the center of several lawsuits brought against Meta by writers such as Sarah Silverman, Michael Chabon, and Paul Tremblay, who claim that its use in training generative AI amounts to copyright infringement…”
Posted in: AI, Copyright, Courts, Internet, Knowledge Management, Legal Research, Libraries