TechTok #13. Does AI use your data for training?

Wait 5 sec.

AI today has seemingly found its way into every single aspect of life, its applications ranging from obvious areas like coding or image processing to less apparent ones like disease diagnostics and legal work. AI is absolutely everywhere. And even if you know very little about how it works, you have probably at least heard that every AI requires troves of data to learn from before it can be put to use.This data has to come from somewhere, and this gets us to the first question of today’s TechTok:Are apps and websites using my data to train AI without me knowing?There is no short definitive answer to this question. The best we can come up with is: “Yes, they do, but not necessarily in the way you might think.” We are aware that you probably didn’t come here to get a broad answer like that. But before we dive any deeper, let’s get one thing clear: “training AI” and “collecting data” are not synonyms, although they are related. To put it simply, to train AI you need data, so finding ways to obtain that data is one of the biggest challenges when you’re building an AI system. However, there are countless other reasons why someone might want to get their hands on your information.The thing is, the concept of online data collection has existed for decades, long before AI even appeared on the digital horizon, and the main driving force behind gathering user data for many years has been advertising. Insanely complex systems have been built to create user profiles and to track users across various apps and websites, all with the goal of knowing exactly which ad to show to which person at what time and to increase the probability that that person clicks the banner. The digital ad market is estimated at about USD 600 billion to USD 700 billion per year, and at the foundation of this market is user data — this should give you an idea about why data is so often called the new oil.Of course, there were other reasons why companies would seek digital data: personalization, recommendations, fraud detection, billing, retention, product analytics — often important in sectors like finance, retail, telecom, and marketplaces. The exact reasons are beyond the point. What we want to highlight here is that global and rampant data collection was not spawned by the emergence and subsequent spread of AI. In fact, in many cases, the collection methods used today to gather data for AI training are the same that have been used for years for other purposes, so AI companies didn’t have to reinvent the wheel, or at least they had a very solid foundation to stand on.The types of data required for ad tracking and AI training overlap heavily too — which might come as a surprise for some. In the minds of many people, the terms ‘AI’ and ‘LLM’ (large language model) are synonyms. Indeed, chatbots (which are basically user-facing shells with an LLM underneath) are perhaps the most commonly interacted with type of AI for an average user. Common sense dictates that training a generative AI used in a chatbot would require datasets that include tons of user-generated text — such as posts and comments on online platforms like Reddit or X, chat inputs, reviews, etc. This is correct, as these LLMs need to learn how people actually talk, how to answer questions, how real-life conversations flow; things like humor, slang, tone. But what many people do not realize is how many different types of AI other than generative there are, built for so many different purposes — recommendation systems, search ranking, ad targeting, just to name a few. For these AI systems, behavioral data is king, while content itself matters much less. And many modern platforms combine both approaches: they need raw content, but also they want to know what you click and when.So, circling back to the initial question: yes, some AI companies take advantage of your data to train their systems, but they largely do that the same way they (and other companies) have been collecting your data before AI for other purposes. And here comes the tricky part — technically, most companies do not collect data behind your back, to train AI or otherwise — doing so is illegal in many jurisdictions. Some go as far as making a public announcement about their incentive to use your data for AI training, although some sugarcoat it more than others. At the same time, it’s a fairly common practice to hide the ongoing data collection behind lengthy privacy policies, tedious terms of service, and other long and boring legal documents. Those with a darker sense of humor might even find it funny that privacy policies that cover data collection for AI training often use the same vague language and broad wording as you would find in similar documents about gathering information for ad tracking purposes.But even if you do your due diligence and power through all the legalese to confirm that the app you want to install doesn’t use your data to feed the proverbial machine, the sad reality is, you are still not in the clear. Sometimes the developers ‘forget’ to mention it, as was in the very recent case where OkCupid, a popular dating app, shared 3 million user photos with an AI company to train on — all without telling its users about it. This is nothing new; the same shady practices have existed forever even before AI. Unfortunately, where there is profit to be gained, there always will be those willing to turn a blind eye to the law to their advantage.How does your data end up training AI?Let’s now take a step back. We’ve touched a little on the topic of which data is being used to train AI and mentioned that anything goes: both raw content, like texts and photos, and behavioral data, like clicks and other interactions. But many readers would probably like us to be more specific and wonder: “What exactly of my data could end up being used for AI, and how?” Well, not all data is used in the same ways. Some data may be more sensitive, and data from different sources may feed AI differently. If your goal is to train AI, there are countless potential sources to get the training data from. For the purposes of this article we will identify four categories, depending on the way the data is collected:Social media (publicly available data)Chatbot conversations (direct input)Platform interactions (behavioral data)Third-party apps and websitesFirst off, if you post or comment something publicly — on Reddit, YouTube, X, Facebook, etc. — that does not automatically mean anyone can use it for AI training, but you also usually lack any real means to ban the platform from training AI on your content or sharing your data with third parties. Of course, everything varies greatly from platform to platform, but the rule of thumb remains: if it’s public, you probably don’t control it. The platforms that don’t make use of users’ data themselves often sell or share it to others, in some form or fashion. Users from the EU are generally protected better than others, thanks to the EU’s advanced privacy legislation. Regulations like GDPR and the EU AI act give EU citizens the rights to be informed, object to certain processing, request access or deletion of their data in some cases, and challenge or restrict the use of their personal data for AI training.But what if you talk to a chatbot directly, what are the chances that your input will be used for AI training? Depends on the service, of course, but more often than not with consumer-facing AI tools anything that you type in or upload may be used for improving that service. Even if you are on a paid plan, unless it’s a corporate/enterprise (not an individual) plan, your data is still mostly treated as fair game. It needs to be mentioned that many AI chatbots at least offer an opt-out feature for users, even if in many cases they are buried somewhere deep in the settings. We imagine that for many readers of this article this is one of the key questions: “How do I opt out of data collection when talking to my chatbot?" It seems important to provide some practical advice here rather than settle for some general words. There are hundreds and even thousands of chatbots, so let’s focus on some of the most common ones (we assume personal use everywhere, and not enterprise or analogs):ChatGPT. Open ChatGPT, go to your profile, then Settings → Data Controls, and turn off “Improve the model for everyone.” OpenAI says this stops your chats from being used to train ChatGPT going forward, though some retention may still apply. OpenAI also used to grant opt-out status upon receiving a message to support. If you did that at some point in the past, OpenAI claims to still honor that request, but this path is no more available to newer users.Perplexity. Open Account settings → Preferences and switch off “AI data retention.” Note that this opt-out will only affect future data, anything collected before the opt-out date may be used by Perplexity for AI training and cannot be deleted or removed.Gemini. In your Google account, go to Data & privacy and find “Gemini Apps Activity,” then select “Turn off” or “Turn off and delete activity.” This will only prevent future sampling and will not affect any past interactions. Mind that with multiple Google products that use Gemini, the exact training/privacy behavior will depend on the product.Claude. Claude doesn’t train its models on your conversations by default, only giving you an option to opt in manually if you’d like to. If you delete a conversation, Anthropic removes it from their systems within approximately 30 days.As for behavioral data collection, a simple (but mostly accurate) way of thinking about it is: the larger the platform, the more they rely on your behavioral data; smaller narrowly functional apps and services rarely engage in tracking your behavior. Big content platforms like YouTube, TikTok, or Netflix, search engines, e-commerce platforms like Amazon or eBay — these are the ones that you can be sure about. They will collect as much data about your activities as they can to hone their recommendation and ranking algorithms. It doesn’t mean that smaller apps don’t do that at all, but for them this kind of tracking is much less relevant.But what about the ‘regular,’ smaller apps and websites that we use every day? Not everything is a chatbot or a huge platform, what if you just install a random app or a game, or visit a smaller website? Again, it is impossible to give a single answer for all of them, as there are literally millions. But, in general, such smaller apps and websites are not interested in your data to train any AIs of their own, and also they rarely directly sell users’ data to someone else who might be. However, it is beyond common for the developers of such apps and websites to include analytics SDKs, ad networks, and other tracking tools for monetization purposes. These tools can, and very much do, collect stuff like behavioral data, device info, usage patterns, and so on. And when this data gets to ad networks, data brokers, and analytics firms, it gets aggregated and can easily be used for modelling, sold, or can otherwise indirectly contribute to AI training (among many other things, of course).When you look at all these ways in which your data can end up in some AI’s training dataset, you might think: “That’s a lot to worry about!” That is true, somewhat, but also keep in mind that not every single bit of information that you provide gets used, and not all companies behave the same way. And, last but not least, there are ways to minimize the amount of data collected about you. Which brings us to the second question of today’s TechTok:Can using an ad blocker and/or a VPN stop AI tracking and data collection?As you just saw, AI tracking takes so many different forms that it is impossible to give a “yes or no” answer to this question. Both an ad blocker and a VPN can help, each in its own way, but not against everything.First of all, neither of them will help if you actively provide data: talk to a chatbot, post on social media, leave comments. Ad blockers and VPNs can’t magically prohibit the platform from using something you have already given them, directly or indirectly. Against that type of data collection, your best bet is privacy settings, opt-out toggles, and laws aimed at protecting privacy. Check out privacy policies and available privacy settings of the platforms and apps you engage with, and if you don’t like what you see, consider picking a different option.What ad blockers can help with is third-party trackers that collect data about you for future use and, to some extent, behavioral tracking. Stopping third-party analytics is, without question, the strongest suit of ad blockers when it comes to preventing your data from leaking. Ad blockers like AdGuard can deal with most, if not nearly all, third-party trackers on websites. Inside apps, things might get trickier, but this is true in general — Android and iOS have rather strict limitations when it comes to interfering with the traffic of other apps.Ad blockers can also help stop the collection of behavioral data, but not entirely. Unfortunately, most major platforms rely heavily on first-party tracking and don’t need third parties to build recommendations, train models, and analyze behavior. Often, blocking first-party tracking, especially on large platforms, interferes with the useful functionality — imagine that you block first-party tracking on YouTube and the videos suddenly stop loading. And yet again, these problems are more pronounced in mobile apps than on websites.Still, an ad blocker is one of the best resources available to you if your goal is to starve the AI training algorithms. But what about VPNs?VPNs are great — some may even say essential — for privacy protection. But when it comes specifically to stopping your data from being used for AI training, their use is limited. Still, they can be helpful, but not in a direct way. VPNs hide your IP and mask your location, making it harder for websites and third-party trackers to link your activity across different sites or build a profile based on your network identity. However, a VPN does not stop the platforms you use from seeing what you do on them. If you are logged into an account, or even just interacting with a website or app, your clicks, searches, and inputs are still recorded directly by that service. A VPN also will not stop third-party trackers from gathering information about you — leave that job to ad blockers (although a VPN may make tracking less precise).Let’s recap: ad blockers and VPNs are great tools in your privacy protection arsenal, and they certainly will not hurt if you seek to protect your data from becoming AI training fodder — especially ad blockers. But in the end, your data’s safety is first and foremost dependent on your own attentiveness and diligence. If you study the privacy policies before using apps and services, if you are mindful about what you post online and what information you share with a chatbot — the chances of your personal details becoming a part of some future AI’s learning dataset can go down significantly. It’s good to have strong tools on your side, but nothing beats good old caution.