What is scraping? | Protection from web scraping & data scraping

Defining web scraping
How it works
Rise in incidents
Business impact
Integrating with DevSecOps
HUMAN stops web scraping

Defining web scraping

Web scraping, or content scraping, is the practice of using automated bots and web crawlers to extract content or data from third-party websites. The scraper can then replicate this data on another website or application.

Web scraping can be a confusing issue from a security perspective, as it is a wide-spread practice in many digital businesses and has legitimate uses. Online businesses might scrape websites for such things as search engines, delivering price comparisons to consumers, and aggregating news or weather content.

Unfortunately, bad actors can also use scraping bots for more nefarious purposes, such as:

Scraping pricing data and using the intel to undercut competitors
Scraping and reposting marketing content to lure users away from competitors
Scraping restricted data post-login to resell on secondary markets

In the most malicious scenarios, cybercriminals deploy bots to scrape user data and resell it or use it for a broader attack. In April and July of 2021, LinkedIn fell victim when data from over one billion user accounts was scraped and offered for sale on the dark web. In May of 2021, Business Insider reported that Facebook had been similarly targeted: scrapers gained information on over 500 million users.

In addition to reselling for a quick profit, attackers scrape a site to identify employee names and deduce username and email formats to launch targeted phishing and account takeover (ATO) attacks.

How it works

Web scraping attacks can be very broad, copying an entire site to see if there is any data which can be exploited, or very targeted, seeking specific data on specific pages. Regardless, every attack starts with a plan.

The attacker may begin by deploying web crawlers or spiders to map a targeted site; identifying URLs, page metadata and identifying access gates such as account log-ins and CAPTCHAs on specific pages. With this, the attacker develops a script for the scraper bots to follow. It tells the scrapers which URLs to go to and what to do on the page. The attacker may also create fake user accounts to register bots as legitimate users on the website and enable them to access paid content.

A web scraper bot will typically send a series of HTTP GET requests to the targeted website to access the HTML source code. Data locators within the scraper are programmed to look for specific data types and save all of the relevant information from the targeted web server’s response to a CSV or JSON file.

In database scraping attacks, a more advanced type of scraping, the scraper interacts with an application on the site to retrieve content from its database. For example, sophisticated bots can be programmed to make thousands of requests to internal application programming interfaces (APIs) for some associated data – like product prices or contact details – that are stored in a database and delivered to a browser via HTTP requests.

Collecting and copying these large data repositories requires an enormous amount of processing power. While businesses engaged in legitimate scraping activity invest in vast server arrays to process the data, criminals are more likely to employ a network of hundreds or thousands of hijacked computers, known as a botnet, which spreads the processing load and helps to mask the malicious activity by distributing the data requests.

Rise in incidents

Web scraping is a rapidly growing threat for many industries with travel and hospitality, e-commerce and media being top targets. Also, across all industries, the more successful your business, the more likely you will be scraped by competitors, which fuels more targeted attacks.

Scraping bots are increasingly more sophisticated and increasingly difficult to detect because they can imitate normal human interactions. Every part of user behavior—mouse movement, keyboard clicking and typing—is mimicked by bots, but there is no intent to these actions other than collecting data.

Scraping bot attacks have also become more widely distributed, mounting low and slow attacks that use thousands of geographically distributed IP addresses, each only requesting a few pages of content, and rotating browser user-agents to avoid being detected by web security tools.

Business impact

Scraping can have negatively impact businesses in several ways:

Loss of competitive advantage
Damaged SEO rank if search engines detect duplicate content
Stolen proprietary data and restricted content
Infrastructure wasted on bot traffic
Slower website performance

In its report, The Business Impact of Website Scraping, Aberdeen Research found that, “The median annual business impact of website scraping is as much as 80% of overall e-commerce website profitability.” For the Media sector, the research estimates “the annual business impact of website scraping is between 3.0% and 14.8% of annual website revenue, with a median of 7.9%.”

Scraping is often not a prosecutable offense. In fact, in April 2022, the 9th circuit court ruled that scraping was not covered under the Computer Fraud and Abuse Act (CFAA). Web scraping isn’t illegal in many cases, especially if the scraped content is publicly available on a website and nothing proprietary is reposted as one’s own.

Integrating with DevSecOps

The most common method used to protect a website from scraping relies on tracking the activity of old attacks coming from suspicious IP addresses and domains. But bad bots find new ways in, so basic detection tools that are based on signatures or volumetric sensors are unable to keep up with changes, leaving site owners with thousands of obsolete threat profiles and an ongoing problem.

Web application firewalls (WAFs) are also commonly used, but are largely ineffective in stopping bot attacks because modern bots are capable of evading detection by mimicking human behavior. Hyper-distributed bot attacks that use many different user-agents, IPs and ASNs easily bypass WAFs and homegrown bot solutions. Homegrown bot management and CAPTCHA challenges are typically no match for advanced scraping bots and only succeed in frustrating site visitors.

HUMAN stops web scraping

HUMAN Scraping Defense protects your web and mobile applications from web scraping bots. It provides the highest level of bot detection accuracy for even the most sophisticated scraping bot attacks. The solution executes different modes of attack responses, including hard blocks, honeypots, misdirection, and serving deceptive content.

Scraping Defense incorporates behavioral profiles, machine learning, and real-time sensor data to detect automated bot attacks. The solution recognizes legitimate search engine crawlers, while blocking malicious bots that intend to harvest your data. And because it mitigates bad bots at the edge, Scraping Defense offers complete protection without impacting site performance and user experience.

What are skewed analytics and how to avoid them

What is bot traffic? | Block bad bots from attacks

What is bot detection? | How to detect & block bad bots

What is bot mitigation? | 4 types of bots & botnets | How to stop bots

What does CAPTCHA mean? | How CAPTCHAs work