
Antoine Vastel at DataDome explains how businesses can avoid becoming ChatGPT’s next prompt
An often-overlooked aspect of the immense power and accuracy of generative AI is the data it is built on, scraped from millions of websites, representing hundreds of gigabytes of text alone.
While ChatGPT does not reveal the list of sites it scrapes from, public estimates are staggering. A Washington Post analysis found that similar AI models such as Google’s T5 and Facebook’s Llama used over 15 million sites, of which business and commercial websites made up nearly 16%.
Scraping on this scale is achieved by deploying bots that can copy and process existing text from publicly available sites at an alarming speed. While this technology is nothing new, with the development of multi-billion-dollar Large Language Models (LLMs) means it is likely that scraping will become more common over larger stretches of the internet.
Typically, data scrapers use automated programs known as web crawlers and scraper bots, which absorb massive amounts of original content across anywhere between thousands and millions of different websites.
A scraper can use these bots to analyse links, web pages, and the HTML structure of target websites, identifying the kinds of content they wish to extract. These programs can duplicate the entirety of a website’s content within seconds, making them unusually hard to detect via manual monitoring.
Once this content has been scraped, it can be used for a variety of purposes, but in the case of AI, this data is used to make their answers more accurate. While this can result in a near-infinite number of different AI answers, it is not uncommon for a program like ChatGPT to reproduce copyrighted or protected content.
It has been shown repeatedly that AI companies do not ask for businesses’ consent before taking their data, which is a growing problem given that companies like OpenAI are so secretive about their dataset.
Data scraping often conjures negative connotations, but that’s not necessarily the case. Retrieval Augmented Generation (RAG) can help AI users get access to real time information while directing traffic to the original site.
Unfortunately though, much of the data scraped for LLMs presents cause for concern. The fact that huge volumes of data is being taken from businesses without their permission represents a serious risk to many companies.
Recently, various companies such as Apple, NVIDIA, Anthropic and Salesforce were exposed for using a dataset including copyrighted content from YouTube videos to train their AI models. Worryingly, the original creators of the dataset, EleutherAI revealed they had scraped their data using “a very popular unofficial API that is both widely used and easily obtainable."
It’s easy to see how this can affect a wide range of businesses, particularly one that relies on insight and analysis to draw in customers. If potential customers can access your knowledge for free through ChatGPT, why bother even coming to your website? If copyrighted literature is available at the click of a button, why pay authors?
As well as threatening the value of a business’s original content, this kind of data scraping can negatively affect the SEO ranking of a website, and reduce traffic. There’s also a risk that scrapers can lift and then distribute potentially sensitive content that malicious actors can use to threaten a businesses reputation and cybersecurity.
From a business perspective, this reality would be incredibly damaging to companies that strive to create value from their intellectual property as well as those that rely on high traffic, such as e-commerce or pay-per-click advertising. Short of a substantive legal change from AI companies that would involve them paying for the content they scrape, businesses will have to take precautions to prevent their data being used as ChatGPTs next prompt.
To prevent data being scraped from their sites, businesses need to invest in the kind of bot and fraud detection software that can accurately detect a real user from an unauthorised actor or bot within milliseconds. Advanced programs like these use pinpoint behavioural and IP analysis to pick up abnormal behaviour with a remarkably high degree of accuracy and use a CAPTCHA to stop them from scraping the site.
In addition to these automated defences, businesses can utilise monitoring software to detect unusual patterns, such as detecting user accounts with a high volume of activity and no purchases. Closely monitoring traffic is also important, given that high volumes of product or page views could be a sign of non-human activity.
Another key for protecting sites against web scrapers is to limit the amount of transferable data on your website by restricting the number of requests an IP address can make to your website. An API with set rate limits and usage policies can control access to your data and ensure it is only used for legitimate purposes.
Businesses should have a TOS document visible on the site that specifically limits rate requests and prohibits data extraction. Which can set clear boundaries on whether companies are allowed to extract data from your website, a crucial factor for litigation.
The site’s robots.txt file should also contain specific information about data collection. Site owners commonly use “robots.txt” files to communicate their intentions when it comes to scraping. While this won’t stop malicious actors, a fair number of web scrapers are instructed to skip over websites with explicit instructions against scraping.
The disregard many AI companies have for companies’ web data is concerning, and ultimately a conversation should be had among policymakers to improve the balance between the needs of LLMs and the legitimate concerns businesses have over their data.
While we wait, businesses should take responsibility for implementing straightforward strategies to protect themselves against data scrapers.
Antoine Vastel is VP of research at AI-powered cyber-fraud platform DataDome
Main image courtesy of iStockPhoto.com and Laurence Dutton
© 2025, Lyonsdown Limited. teiss® is a registered trademark of Lyonsdown Ltd. VAT registration number: 830519543