At Least 15 Million YouTube Videos Have Been Snatched by AI Companies

Wait 5 sec.

Editor’s note: This analysis is part of The Atlantic’s investigation into how YouTube videos are taken to train AI tools. You can use the search tool directly here, to see whether videos you’ve created or watched are included in the data sets. This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.When Jon Peters uploaded his first video to YouTube in 2010, he had no idea where it would lead. He was a professional woodworker running a small business who decided to film himself making a dining table with some old legs he had found in a barn. It turned out that people liked his candid style, and as he posted more videos, a fan base began to grow. “All of a sudden there’s people who appreciate the work I’m doing,” he told me. “The comments were a motivator.” Fifteen years later, his channel has more than 1 million subscribers. Sometimes he gets photos of people in their shops, following his guidance from a big TV on the wall—most of his viewers, Peters told me, are woodworkers looking to him for instruction.But Peters’s channel could soon be obsolete, along with millions of other videos created by people who share their expertise and advice on YouTube. Over the past few months, I’ve discovered more than 15.8 million videos from more than 2 million channels that tech companies have, without permission, downloaded to train AI products. Nearly 1 million of them, by my count, are how-to videos. You can find these videos in at least 13 different data sets distributed by AI developers at tech companies, universities, and research organizations, through websites such as Hugging Face, an online AI-development hub.In most cases the videos are anonymized, meaning that titles and creator names are not included. I was able to identify the videos by extracting unique identifiers from the data sets and looking them up on YouTube—similar to the process I followed when I revealed the contents of the Books3, OpenSubtitles, and LibGen data sets. You can search the data sets using the tool below, typing in channel names like “MrBeast” or “James Charles,” for example.(A note for users: Just because a video appears in these data sets does not mean it was used for training by AI companies, which could choose to omit certain videos when developing their products.)To create AI products capable of generating video, developers need huge quantities of videos, and YouTube has become a common source. Although YouTube does offer paying subscribers the ability to download videos and watch them through the company’s app whenever they’d like, this is something different: Video files are being ripped from YouTube en masse and saved in files that are then fed to AI algorithms. This kind of downloading violates the platform’s terms of service, but many tools allow AI developers to download videos in this way. YouTube appears to have done little, if anything, to stop the mass downloading, and the company did not respond to my request for comment.Not all YouTube videos are copyrighted (and some are uploaded by people who don’t own the copyrights), but many are. Unauthorized copying or distribution of those videos is illegal, but whether AI training constitutes a form of copying or distribution is still a question being debated in many ongoing lawsuits. Tech companies have argued that training is a “fair use” of copyrighted work, and some judges have disagreed in their responses. How the courts ultimately apply the law to this novel technology could have massive consequences for creators’ motivations to post their work on YouTube and similar platforms—if tech companies are able to continue taking creators’ work to build AI products that compete with them, then creators may have little choice but to stop sharing.Generative-AI tools are already producing videos that compete with human-made work on YouTube. AI-generated history videos with hundreds of thousands of views and many inaccuracies are drowning out fact-checked, expert-produced content. Popular music-remix videos are frequently created using this technology, and many of them perform better than human-made videos.  The problem extends far beyond YouTube, however. Most modern chatbots are “multimodal,” meaning they can respond to a question by creating relevant media. Google’s Gemini chatbot, for instance, will produce short clips for paying users. Soon, you may be able to ask ChatGPT or another generative-AI tool about how to build a table from found legs and get a custom how-to video in response. Even if that response isn’t as good as any video Peters would make, it will be immediate, and it will be tailor-made to your specifications. The online-publishing business has already been decimated by text-generation tools; video creators should expect similar challenges from generative-AI tools in the near future.Many major tech companies have used these data sets to train AI, according to research papers I’ve read and AI developers I’ve spoken with. The group includes Microsoft, Meta, Amazon, Nvidia, Runway, ByteDance, Snap, and Tencent. I reached out to each of these companies to ask about their use of these data sets. Only Meta, Amazon, and Nvidia responded. All three said they “respect” content creators and believe that their use of the work is legal under existing copyright law. Amazon also shared that, where video is concerned, it is currently focused on developing ways to generate “compelling, high-quality advertisements from simple prompts.”We can’t be certain whether all these these companies will use the videos to create for-profit video-generating tools. Some of the work they’ve done may be simply experimental. But a few of these companies have an obvious interest in pursuing commercial products: Meta, for instance, is developing a suite of tools called Movie Gen that creates videos from text prompts, and Snap offers “AI Video Lenses” that allow users to augment their videos with generative AI. Videos such as the ones in these data sets are the raw material for products like these; much as ChatGPT couldn’t write like Shakespeare without first “reading” Shakespeare, a video generator couldn’t construct a fake newscast without “watching” tons of recorded broadcasts. In fact, a large number of the videos in these data sets are from news and educational channels, such as the BBC (which has at least 33,000 videos in the data sets, across its various brands) and TED (nearly 50,000). Hundreds of thousands of others—if not more—are from individual creators, such as Peters.AI companies are more interested in some videos than others. A spreadsheet leaked to 404 Media by a former employee at Runway, which builds AI video-generation tools, shows what the company valued about certain channels: “high camera movement,” “beautiful cinematic landscapes,” “high quality scenes from movies,” “super high quality sci-fi short films.” One channel was labeled “THE HOLY GRAIL OF CAR CINEMATICS SO FAR”; another was labeled “only 4 videos but they are really well done.”Developers seek out high-quality videos in a variety of ways. Curators of two of the data sets collected here—HowTo100M and HD-VILA-100M—prioritized videos with high view counts on YouTube, equating popularity with quality. The creators of another data set, HD-VG-130M, noted that “high view count does not guarantee video quality,” and used an AI model to select videos of high “aesthetic quality.” Data-set creators often try to avoid videos that contain overlaid text, such as subtitles and logos, so these identifying features don’t appear in videos generated by their model. So, some advice for YouTubers: Putting a watermark or logo on your videos, even a small one, makes them less desirable for training.To prepare the videos for training, developers split the footage into short clips, in many cases cutting wherever there is a scene or camera change. Each clip is then given an English-language description of the visual scene so the model can be trained to correlate words with moving images, and to generate videos from text prompts. AI developers have a few methods of writing these captions. One way is to pay workers to do it. Another is to use separate AI models to generate a description automatically. The latter is more common, because of its lower cost.AI video tools aren’t yet as mainstream as chatbots or image generators, but they are already in wide use. You may already have seen AI-manipulated video without realizing it. For example, TED has been using AI to dub speakers’ talks in different languages. This includes the video as well as the audio: Speakers’ mouths are lip-synched with the new words so it looks like they’re speaking Japanese, French, or Russian. Nishat Ruiter, TED’s general counsel, told me this is done with the speakers’ knowledge and consent.There are also consumer-facing products for tweaking videos with AI. If your face doesn’t look right, for example, you can try a face-enhancer such as Facetune, or ditch your mug entirely with a face-swapper such as Facewow. With Runway’s Aleph, you can change the colors of objects, or turn sunshine into a snowstorm.Then there are tools that generate new videos based on an image you provide. Google encourages Gemini users to animate their “favorite photos.” The result is a clip that extrapolates eight seconds of movement from an initial image, making a person dance, cook, or swing a golf club. These are often both amazing and creepy. “Talking head generation”—for employee-orientation videos, for example—is also advancing. Vidnoz AI promises to generate “Realistic AI Spokespersons of Any Style.” A company called Arcads will generate a complete advertisement, with actors and voiceover. ByteDance, the company that operates TikTok, offers a similar product called Symphony Creative Studio. Other applications of AI video generation include virtual try-on of clothes, generating custom video games, and animating cartoon characters and people.Some companies are both working with AI and simultaneously fighting to defend their content from being pilfered by AI companies. This reflects the Wild West mentality in AI right now—companies exploiting legal gray areas to see how they can profit. As I investigated these data sets, I learned about an incident involving TED—again, one of the most-pilfered organizations in the data sets captured here, and one that is attempting to employ AI to advance its own business. In June, the Cannes Lions international advertising festival gave one of its Grand Prix awards to an ad that included deepfaked footage from a TED talk by DeAndrea Salvador, currently a state senator in North Carolina. The ad agency, DM9, “used AI cloning to change her talk and repurposed it for a commercial ad campaign,” Ruiter told me on a video call recently. When the manipulation was discovered, the Cannes Lions festival withdrew the award. Last month, Salvador sued DM9 along with its clients—Whirpool and Consul—for misappropriation of her likeness, among other things. DM9 apologized for the incident and cited “a series of failures in the production and sending” of the ad. A spokesperson from Whirlpool told me the company was unaware the senator’s remarks had been altered.Others in the film industry have filed lawsuits against AI companies for training with their content. In June, Disney and Universal sued Midjourney, the maker of an image-generating tool that can produce images containing recognizable characters (Warner Brothers joined the lawsuit last week). The lawsuit called Midjourney a “bottomless pit of plagiarism.” The following month, two adult-film companies sued Meta for downloading (and distributing through BitTorrent) more than 2,000 of their videos. Neither Midjourney nor Meta has responded to the allegations, and neither responded to my request for comment. One YouTuber filed their own lawsuit: In August of last year, David Millette sued Nvidia for unjust enrichment and unfair competition with regard to the training of its Cosmos AI, but the case was voluntarily dismissed months later.The Disney characters and the deepfaked Salvador ad are just two instances of how these tools can be damaging. The floodgates may soon be opening further. Thanks to the enormous amount of investment in the technology, generated videos are beginning to appear everywhere. One company, DeepBrain AI, pays “creators” to post AI-generated videos made with its tools on YouTube. It currently offers $500 for a video that gets 10,000 views, a relatively low threshold. Companies that run social-media platforms, such as Google and Meta, also pay users for content, through ad-revenue sharing, and many directly encourage the posting of AI-generated content. Not surprisingly, a coterie of gurus has arrived to teach the secrets of making money with AI-generated content.Google and Meta have also trained AI tools on large quantities of videos from their own platforms: Google has taken at least 70 million clips from YouTube, and Meta has taken more than 65 million clips from Instagram. If these companies succeed in flooding their platforms with synthetic videos, human creators could be left with the unenviable task of competing with machines that churn out endless content based on their original work. And social media will become even less social than it is.   I asked Peters if he knew his videos had been taken from YouTube to train AI. He said he didn’t, but he wasn’t surprised. “I think everything’s gonna get stolen,” he told me. But he didn’t know what to do about it. “Do I quit, or do I just keep making videos and hope people want to connect with a person?”