Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

Generative AI and intellectual property

Benedict Evans: “If you put all the world’s knowledge into an AI model and use it to make something new, who owns that and who gets paid? This is a completely new problem that we’ve been arguing about for 500 years. OpenAI is no longer open about exactly what it uses, but even if it isn’t training on pirated books, it certainly uses some of the ‘Common Crawl, which is a sampling of a double-digit percentage of the entire web. So, your website might be in there. But the training data is not the model. LLMs are not databases. They deduce or infer patterns in language by seeing vast quantities of text created by people – we write things that contain logic and structure, and LLMs look at that and infer patterns from it, but they don’t keep it. So ChatGPT might have looked at a thousand stories from the New York Times, but it hasn’t kept them. Moreover, those thousand stories themselves are just a fraction of a fraction of a percent of all the training data. The purpose is not for the LLM to know the content of any given story or any given novel – the purpose is for it to see the patterns in the output of collective human intelligence.

That is, this is not Napster. OpenAI hasn’t ‘pirated’ your book or your story in the sense that we normally use that word, and it isn’t handing it out for free. Indeed, it doesn’t need that one novel in particular at all. In Tim O’Reilly’s great phrase, data isn’t oil; data is sand. It’s only valuable in the aggregate of billions,, and your novel or song or article is just one grain of dust in the Great Pyramid. OpenAI could retrain ChatGPT without any newspapers, if it had to, and it might not matter – it might be less able to answer detailed questions about the best new coffee shops on the Upper East Side of Manhattan, but again, that was never the aim. This isn’t supposed to be an oracle or a database. Rather, it’s supposed to be inferring ‘intelligence’ (a placeholder word) from seeing as much as possible of how people talk, as a proxy for how they think…”

Sorry, comments are closed for this post.