OpenAI won’t say whose content trained its video tool. We found some clues.

Washington Post and free via MSN: “OpenAI’s video generation tool, Sora, can create high-definition clips of just about anything you could ask for a breakthrough in artificial intelligence expected to transform the entertainment industry. But whose data OpenAI used to create its groundbreaking system is a mystery. With ChatGPT, OpenAI helped popularize the now-standard industry practice of building more capable AI tools by scraping vast quantities of text from the web without consent. With Sora, launched in December, OpenAI staff said they built a pioneering video generator by taking a similar approach. They developed ways to feed the system more online video — in more varied formats — including vertical videos and longer, higher-resolution clips. “You want to use all the data in its native format that exists,” Tim Brooks, the project’s then co-lead, said at an AI hackathon in April 2024. But OpenAI has not specified which videos it grabbed to make Sora, saying only that it combined “publicly available and licensed data.”

Posted in: AI, Copyright, E-Records, Internet, Knowledge Management