Home > Media News >
Source: http://www.mashable.com
Mashable: A new investigation claims that tech companies used subtitles from more than 48,000 YouTube channels — including from top creators like MrBeast and Marques Brownlee and higher learning institutions like MIT and Harvard — to train their AI models, even though YouTube prohibits the harvesting of platform content without permission.
The investigation, conducted by Proof News and published in conjunction with Wired, found that companies like Anthropic, Nvidia, Apple, and Salesforce used a dataset of 173,536 YouTube videos including those from Khan Academy, MIT, Harvard, The Wall Street Journal, NPR, the BBC and late night shows like The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.
Marques Brownlee posted an Instagram Reel noting that, in his opinion, "the real story is Apple and a whole bunch of other tech companies are training their AI models using data that they buy from third party data scraping companies some of which get their data in slightly illegal ways... Apple can technically say they're not at fault for this."
Wired says that representatives for the non-profit AI research lab that scraped and disseminated the YouTube dataset, EleutherAI, did not respond to the publication's requests for comment. The dataset is part of a compilation the nonprofit calls The Pile, which also includes material from the European Parliament, English Wikipedia, and emails from the employees of the Enron Corporation released during the federal investigation into the company in the early 2000s.
Wired reports that most of the collections that make up The Pile are accessible to "anyone on the internet with enough space and computing power to access them." These include Apple, Nvidia, Salesforce, Bloomberg and Databricks, all of which have publicly acknowledged their use of The Pile to train AI models.
Jennifer Martinez, a spokesperson for AI startup Anthropic, said in a statement that while the company had used The Pile to train its generative AI assistant, "YouTube’s terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to the Pile authors."
In his Instagram Reel, Brownlee added, "The double whammy is that I actually pay for more accurate manual transcriptions on every video that we put out... so that means the stolen transcriptions specifically are paid content that's being stolen more than once."
His concerns echo those of creators across the world who are concerned that their work will be consumed or exploited by AI without compensation or permission. Many are currently suing tech companies for unapproved use of their work.
Wired reports that The Pile is still available on file-sharing services but has been removed from its official download site. Proof News has created a tool to search for creators in the YouTube AI training dataset.