Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

A.I. brings shadow libraries into the spotlight

The New York Times [free link] – to see this text scroll down the page: ” Large language models, or L.L.M.s, the artificial intelligence systems that power tools like ChatGPT, are developed using enormous libraries of text. Books are considered especially useful training material, because they’re lengthy and (hopefully) well-written. But authors are starting to push back against their work being used this way. This week, more than 9,000 authors, including Margaret Atwood and James Patterson, called on tech executives to stop training their tools on writers’ work without compensation. That campaign has cast a spotlight on an arcane part of the internet: so-called shadow libraries, like Library Genesis, Z-Library or Bibliotik, that are obscure repositories storing millions of titles, in many cases without permission — and are often used as A.I. training data. A.I. companies have acknowledged in research papers that they rely on shadow libraries. OpenAI’s GPT-1 was trained on BookCorpus, which has over 7,000 unpublished titles scraped from the self-publishing platform Smashwords. To train GPT-3, OpenAI said that about 16 percent of the data it used came from two “internet-based books corpora” that it called “Books1” and “Books2.” According to a lawsuit by the comedian Sarah Silverman and two other authors against OpenAI, Books2 is most likely a “flagrantly illegal” shadow library. These sites have been under scrutiny for some time. The Authors Guild, which organized the authors’ open letter to tech executives, cited studies in 2016 and 2017 that suggested text piracy depressed legitimate book sales by as much as 14 percent. Efforts to shut down these sites have floundered. Last year, the F.B.I., with help from the Authors Guild, charged two people accused of running Z-Library with copyright infringement, fraud and money laundering. But afterward, some of these sites were moved to the dark web and torrent sites, making it harder to trace them. And because many of these sites are run outside the United States and anonymously, actually punishing the operators is a tall task.”

Sorry, comments are closed for this post.