Journalism of Courage
Advertisement
Premium

YouTube videos used to train AI models? Why creators should be concerned

Ever since reports surfaced suggesting that large tech companies are using YouTube content to train their AI models, the creator ecosystem has been on edge.

Gen AI model being trained on YouTube videosInterestingly, none of the tech giants like Google, OpenAI, etc., have acknowledged the use of YouTube content. However, they stop short of making definitive claims on the issue. (Illustration created using DALL-E)

In June, Mustafa Suleyman, CEO of Microsoft’s new AI division, made a startling claim. He told CNBC’s Andrew Ross that anything one publishes on the internet becomes ‘freeware’ and that it can be copied and used to train AI models. In the last few weeks, there has been significant scrutiny and reporting on how generative AI companies may be pulling videos and transcripts from YouTube and using these creations of independent creators to train their AI models. In July, 404 Media, an online publication revealed generative AI video company Runway trained its models on thousands of videos without consent.

In the last few months, the issue of YouTube content being used to train generative AI models has become a hotly debated matter among the creator community. This is a complex issue that has moved on to serious territory such as consent, compensation, and the rights of creators. In this article, we will examine the issue, what big tech companies said, and how training AI models on YouTube content impact creators.

Why is it a hotly contested issue among creators?

The realm of generative AI is evolving at a brisk pace, and to make more powerful models capable of performance and efficiency, corporations need access to massive amounts of data. The concern that has been bustling among the creator community is that their videos are being used to train these large AI models without their explicit permission. 

Several investigative reports in recent times have suggested that AI companies have been harnessing large amounts of content from YouTube which includes audio, visuals, and transcripts to develop their proprietary models. Although none of the big tech companies have openly acknowledged this, a practice like this raises several serious ethical, legal, and financial questions, many creators are uneasy, and in some cases, feel exploited. This month, YouTuber David Millette filed a lawsuit against chipmaker Nvidia, alleging that the company created a video model by scraping content from YouTube without any kind of authorisation from creators. 

Similarly, an investigation by Proof News, a data-driven reporting and analysis portal, in July, revealed that subtitles from 1,73, 536 YouTube Videos from over 48,000 channels were used by tech giants like Nvidia, Apple, Anthropic, and Salesforce to train their models. According to the report, these subtitles contain video transcripts from online learning platforms like Harvard, MIT, and Khan Academy. The portal has created a tool for content creators to see if their work has been included in the YouTube AI training dataset. According to the report, videos of popular creators like Marques Brownlee, MrBeast, PewDiePie, etc., were also used to train AI models. 

What is the main issue?

For many YouTubers, the primary concern is their content is being used to train AI models without their explicit permission. In simple words, when a creator uploads a video on YouTube, they essentially agree to the terms of service. This grants YouTube a broad licence to use the content. According to the terms of service, YouTube can reproduce, distribute, and even create derivative works from the content. However, it is nowhere mentioned that the content can also be used to train AI models. It needs to be noted that this use case did not exist when the terms were drafted originally.

“By providing Content to the Service, you grant to YouTube a worldwide, non-exclusive, royalty-free, transferable, sublicensable license to use that Content (including to reproduce, distribute, prepare derivative works, display and perform it). YouTube may only use that Content in connection with the Service and YouTube’s (and its successors’ and Affiliates) business, including for the purpose of promoting and redistributing part or all of the Service,” an excerpt from the Terms of Service as seen on YouTube at present. 

Story continues below this ad

While the terms here appear clear, it is still vague. This lack of clarity is something that has been unsettling many creators. Based on news reports and social media posts, many creators feel that if their content holds enough value to be used to train AI models that cost billions, then they should be paid accordingly. At a time when companies are signing mega deals to use data for training their AI models, small-time creators seem left out as they do not get any recognition or reward for their content. 

What do tech leaders say?

When asked if YouTube content was being used to train Sora and if it would be against policy, Neil Mohan, CEO of the platform, responded by saying that some creators’ contracts with the platform mean that their content could be used. 

“When a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of service are going to be abided by. Our terms of service does allow for YouTube content, some YouTube content like the title of a video or the channel name or the creator’s name, to be scrapped because that’s how you enable the open web…But it does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service,” Mohan told Emily Chang in an interview in May. 

Similarly, Suleyman, in an interview with CNBC, said, “I think that with respect to content that is already on the open web, the social contract of that content since the ’90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it—that has been free, as you like.” On the other hand, when OpenAI CTO Mira Murati was asked about the same in March in a WSJ interview, she responded with a confused expression. When persisted, she concluded by saying, “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data.” 

Story continues below this ad

What is the legal standpoint?

When it comes to training AI models, the legal landscape seems murky. Companies like Google may even argue that their broad licences may allow them to use YouTube content for AI training. However, this is still unclear and debatable legally. At this point, numerous lawsuits have been filed against the legality of using copyrighted content to train AI without explicit permission from the creators. 

Much beyond legal issues, there are also ethical concerns –– creators hold their work dearly, and most may not be comfortable with the idea that their content is being used in ways that they never fathomed. The whole idea of AI generating new content from original work without consent seems to many like a violation of their creativity and craft.

The rapid pace at which AI is advancing currently would mean more requirements for massive datasets to power the AI models. This is bound to put creators in a tricky situation. If video-sharing platforms like YouTube use content for AI training without consent, individual creators may end up losing control over their work. This is indeed indicative of a broader issue – the lopsided nature of power, especially between large corporations and individuals. Large tech companies can navigate legal complexities with ease, on the contrary, independent creators would have fewer resources to protect their rights.

As the issue gains momentum, YouTube creators will have to stay informed and raise their concerns. They should collectively push for more transparency from platforms on how their content is being used, especially with respect to training AI models. At present, Elon Musk’s Grok offers users to opt out of having their interactions with the chatbot to be used for AI training. This can be a great way to bring transparency and YouTube creators too should be given similar options to opt-out. 

Bijin Jose, an Assistant Editor at Indian Express Online in New Delhi, is a technology journalist with a portfolio spanning various prestigious publications. Starting as a citizen journalist with The Times of India in 2013, he transitioned through roles at India Today Digital and The Economic Times, before finding his niche at The Indian Express. With a BA in English from Maharaja Sayajirao University, Vadodara, and an MA in English Literature, Bijin's expertise extends from crime reporting to cultural features. With a keen interest in closely covering developments in artificial intelligence, Bijin provides nuanced perspectives on its implications for society and beyond. ... Read More

Tags:
  • artificial intelligence youtube
Edition
Install the Express App for
a better experience
Featured
Trending Topics
News
Multimedia
Follow Us
Express ExclusiveAIIMS study: 6 in 10 top Indian doctors not trained to certify brain death, hurting organ donation
X