Web scraping attacks can be very broad, copying an entire site to see if there is any data which can be exploited, or very targeted, seeking specific data on specific pages. Regardless, every attack starts with a plan.
The attacker may begin by deploying web crawlers or spiders to map a targeted site; identifying URLs, page metadata and identifying access gates such as account log-ins and CAPTCHAs on specific pages. With this, the attacker develops a script for the scraper bots to follow. It tells the scrapers which URLs to go to and what to do on the page. The attacker may also create fake user accounts to register bots as legitimate users on the website and enable them to access paid content.
A web scraper bot will typically send a series of HTTP GET requests to the targeted website to access the HTML source code. Data locators within the scraper are programmed to look for specific data types and save all of the relevant information from the targeted web server’s response to a CSV or JSON file.
In database scraping attacks, a more advanced type of scraping, the scraper interacts with an application on the site to retrieve content from its database. For example, sophisticated bots can be programmed to make thousands of requests to internal application programming interfaces (APIs) for some associated data – like product prices or contact details – that are stored in a database and delivered to a browser via HTTP requests.
Collecting and copying these large data repositories requires an enormous amount of processing power. While businesses engaged in legitimate scraping activity invest in vast server arrays to process the data, criminals are more likely to employ a network of hundreds or thousands of hijacked computers, known as a botnet, which spreads the processing load and helps to mask the malicious activity by distributing the data requests.