The Millions of Songs Mashed Into AI-Generated Music

The Atlantic [no paywall]: “…AI music generators can simulate human performances with surprising fidelity, but first they have to be trained on enormous quantities of those human performances. The actual recordings that go into any model are a closely guarded secret—AI companies have claimed they are proprietary—but the number of songs is almost certainly huge, spanning genres and time periods. As part of my series of investigations into AI training data, I recently discovered four giant datasets of songs that are being shared within the AI-development community. One has 12 million tracks. Another has 9 million. The two smaller datasets each have more than 100,000. They include hits from major pop artists such as Bad Bunny, Nirvana, Taylor Swift, Billie Eilish, Pearl Jam, Elvis Costello, Sheryl Crow, and the Beatles. (The New Radicals’ “You Get What You Give” is in two of the datasets.) Jazz artists such as Miles Davis, John Zorn, and Vijay Iyer are featured, as are classical composers and tens of thousands of minor artists across genres. The 12-million-track dataset, on its own, would take 91 years to listen to. You can search for an artist in the datasets [within the article]. These datasets are only four examples of the many sources available to AI developers. I found them by reading research papers published by developers and scouring AI data-sharing sites. The datasets have been downloaded thousands of times. Google has written about using one of them—more than 100,000 songs downloaded from the Free Music Archive, a site that allows free streaming for personal listening but requires payments for commercial use—to train AI models, and Stability has used some songs from the same dataset. But because of the industry’s secrecy around training data, we don’t currently know who has used the others…”

Facebook LinkedIn

The Millions of Songs Mashed Into AI-Generated Music

Thank you!