AI companies used YouTube videos without permission to train models

AI models trained on YouTube subtitles? No wonder they sometimes sound like they’re giving TED Talks while reviewing the latest video game.
AI companies use YouTube videos without permission to train models

An investigation by Proof News has found that several top AI companies, including Apple, Nvidia, and Anthropic, have used transcripts from thousands of YouTube videos to train their AI models without the creators’ knowledge or permission.

The investigation reveals that subtitles from 173,536 YouTube videos, spanning more than 48,000 channels, were used to build AI training datasets. These datasets were leveraged by major tech firms, including not only the aforementioned companies but also Salesforce and Bloomberg, according to the investigation. Notable educational channels such as Khan Academy, MIT, and Harvard, alongside media conglomerates like the Wall Street Journal and NPR, were among the sources tapped.

YouTube giants were not spared either. Popular creators like MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie saw hundreds of their videos included in the dataset. “No one came to me and said, ‘We would like to use this,’” says David Pakman, host of “The David Pakman Show,” whose channel had nearly 160 videos swept into the dataset. Pakman, whose channel posts multiple videos daily, emphasizes, “This is my livelihood, and I put time, resources, money, and staff time into creating this content.”

The dataset in question, named YouTube Subtitles, was created by EleutherAI, an organization focused on lowering barriers to AI development. According to their research paper from December 2020, the YouTube Subtitles dataset is part of a more extensive compilation called the Pile, which also includes materials from various other sources such as the European Parliament and English Wikipedia. The dataset comprises plain text of video subtitles, often with translations into multiple languages.

When questioned by Proof News about the ethical implications of using this data without permission, EleutherAI did not respond. Other companies implicated in the investigation have issued varied responses. Jennifer Martinez, a spokesperson for Anthropic, confirmed using the Pile in their AI models, stating, “YouTube’s terms cover direct use of its platform, which is distinct from the use of the Pile dataset.”

Salesforce also acknowledged using the Pile dataset for research purposes, with Caiming Xiong, their vice president of AI research, asserting that the dataset was “publicly available.” However, public availability does not necessarily absolve companies from ethical scrutiny. Many creators feel that using their content without permission is fundamentally unfair, especially when it is used to develop AI products that could compete with or replace their work.

Creators like Dave Wiskus, CEO of Nebula’s creator-owned streaming service, are vocal about their discontent. “It’s theft,” Wiskus states bluntly. He argues that using creators’ work without their consent is disrespectful and harmful, as studios may use generative AI to replace artists.

Earlier this year, The New York Times reported that Google, which owns YouTube, used videos from the platform to train its models. Similarly, OpenAI faced allegations of using YouTube videos without authorization to train its AI. These incidents indicate a broader trend in the AI industry, where the rush for high-quality training data often leads to ethically questionable practices.

While AI companies argue that their actions fall under “fair use” and aim to democratize access to AI technology, creators call for compensation and regulation. “If you’re profiting off of work that I’ve done [to build a product] that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation,” says Dave Farina, host of “Professor Dave Explains.”

OpenAI CTO Mira Murati recently addressed the potential impact of AI on employment, predicting that AI could eliminate jobs, particularly those involving repetitive tasks. Murati emphasized the need for robust risk management and societal restructuring to ensure that the economic benefits of AI are widely distributed, suggesting solutions such as Universal Basic Income. While AI can enhance productivity and innovation, it also poses risks of job displacement and increased inequality, necessitating careful planning and collaboration among AI developers, regulators, and society.

This investigation by Proof News is yet another call for more transparent regulations and better practices in the AI industry. As AI continues to evolve, it is crucial to address these ethical dilemmas to ensure that the technology develops reasonably and beneficially for all stakeholders involved, especially the people who create the content. The creators whose content has been used without consent deserve recognition and compensation for their work, and AI companies must be held accountable for their data practices.

Posted by Alex Ivanovs

Alex is the lead editor at Stack Diary and covers stories on tech, artificial intelligence, security, privacy and web development. He previously worked as a lead contributor for Huffington Post for their Code column.