Erez Hasson, Strategist, Application Security at Imperva explains why organisations need to defend themselves against web scraping bots
Web scraping – using automation (bots) to extract data or content from websites – is essential to how the internet functions today. For example, search engine crawlers such as Googlebot are good and necessary bots that index and rank websites to maintain a searchable inventory of the online sites so we can find what we need.
But “bad” bots that copy website data with more nefarious motives in mind exist in a legal and moral grey area. The existence of these “bad” bots raises a host of questions, such as:
- Who owns data?
- How much right do other people have to use that data?
- Whose responsibility is it to protect that data?
- How far can you go to stop web scraping bots?
- And can businesses do anything about bad web scraping bots?
Lost revenue and scams
While web scraping is still being litigated, it poses real, immediate problems for many organisations. Data, including personal data of customers or users, could be harvested from websites then sold on to other organisations – resulting in unhappy users and breaches of privacy.
As Imperva’s own research shows, industries such as e-commerce, travel and gambling are especially attractive targets. For instance, organisations can use bots to scrape travel sites for prices and other data; not only collating information but also aggressively positioning their own offerings and advertising, and ultimately driving customers away from actual providers and into the arms of competitors who may well offer an inferior service.
Naturally, this can end up in a race to the bottom – any organisation seeing a competitor adopting these tactics without consequences will be tempted to do the same.
While the above might seem part and parcel of doing business to some, some of the most nefarious web scraping incidents see cybercriminals carbon copy entire websites. This content theft is then used to scam potential customers, or gain ad revenue from another organisation’s content. At the very least, duplicated content can cause serious damage to SEO rankings and harm innocent businesses.
Embracing a defensive position
Bad bots represent a problem for businesses, but how can organisations protect their data? In short, they must embrace a defensive position with the goal of protecting proprietary and customer data, whilst maintaining websites’ legitimate flow of traffic.
A comprehensive, dedicated bot management solution is essential – one which enables legitimate bots to perform their task while simultaneously providing protection to spot and mitigate hostile bots. For instance, organisations still want their websites to show up on search engines, but don’t want to fall victim to fake Google or other search bots that are often used to commit DDoS attacks.
In addition to bot protection, businesses can use other tools and techniques to keep their data safe and ensure that APIs aren’t exposed. This includes limiting the number of searches carried out by specific IP addresses, and using captcha verification to ensure traffic is from actual humans requiring registration and logins.
This makes it harder for web scrapers to take advantage of a business’s system by reverse engineering and using an organisation’s own APIs and endpoints in scraper programs.
Here to stay – for now
Although not yet unlawful, businesses need to ensure that they are doing all they can to keep their data safe from web scrapers. Losing out on potential revenue – or worse, losing content to criminals and having your data duplicated and sold on – isn’t a good prospect. Regardless of the legal consensus on web scraping, cyber-criminals will almost certainly continue to target websites as part of their criminal activity.
Taking up a defensive position to mitigate against web scraping bots and help to keep data safe is always going to be a wise investment.
Erez Hasson is Strategist, Application Security at Imperva
Main image courtesy of iStockPhoto.com