Web scraping is a technique for extracting information from web sites that often uses automated programs, or bots (short for web robots), opening many sessions, or initiating many transactions. You can configure Application Security Manager™ (ASM) to detect and prevent various web scraping activities on the web sites that it is protecting.
ASM™ provides the following methods to address web scraping attacks. These methods can work independently of each other, or they can work together to detect and prevent web scraping attacks.
The BIG-IP® system can accurately detect web scraping anomalies only when response caching is turned off.
For web scraping detection to work properly, you should understand the following prerequisites:
By default, Application Security Manager™ allows requests from these well-known search engines and legitimate web robots:
You can add other search engines to the allowed search engine list; for example, if your web application uses an additional search engine. The list applies globally to all security policies on the system for which web scraping detection is enabled.
|Rate Limiting||When enabled, the system drops random sessions exhibiting suspicious behavior until the session opening rate is the same as the historical legitimate value. If you select this option, the screen displays an option for dropping requests from IP addresses with a bad reputation.|
|Drop IP Addresses with bad reputation||This option is available only if you have enabled rate limiting. When enabled, the system drops requests originating from IP addresses that are in the system’s IP address intelligence database when the attack is detected; no rate limiting will occur. (Attacking IP addresses that do not have a bad reputation undergo rate limiting, as usual.) You also need to set up IP address intelligence, and at least one of the IP intelligence categories must have its Alarm or Block flag enabled.|
|Sessions opened per second increased by||The system considers traffic to be an attack if the number of sessions opened per second increased by this percentage. The default value is 500%.|
|Sessions opened per second reached||The system considers traffic to be an attack if the number of sessions opened per second is equal to or greater than this number. The default value is 50 sessions opened per second.|
|Minimum sessions opened per second threshold for detection||The system only considers traffic to be an attack if this value plus one of the sessions opened values is exceeded. The default value is 25 sessions opened per second.|
|Session transactions above normal by||The system considers traffic in a session to be an attack if the number of transactions in the session is more than normal by this percentage (and minimum session value is met). Normal refers to the average number of transactions per session for the whole site during the last hour. The default value is 500%.|
|Sessions transactions reached||The system considers traffic to be an attack if the number of transactions per sessions is equal to or greater than this number (and minimum session value is met). The default value is 400 transactions.|
|Minimum session transactions threshold for detection||The system considers traffic to be an attack only if the number of transactions per session is equal to or greater than this number, and at least one of the sessions transactions numbers was exceeded. The default value is 200 transactions.|
This figure shows a Web Scraping Statistics event log on an Application Security Manager™ (ASM) system where several web scraping attacks, with different attack types, have occurred.
Web scraping statistics event log
The next figure shows details on a web scraping attack started on November 17 at 7:14PM. The attack type was Session Resets by Persistent Client Identification, and it occurred when the number of cookie deletions detected through the use of fingerprinting exceeded the configured threshold.
Example cookie deletion attack (fingerprinting)
The next figure shows details on a web scraping attack started on November 17 at 7:20PM. The attack type was Session Resets by Persistent Client Identification. It occurred when the number of cookie deletions detected through the use of persistent client identification exceeded the configured threshold (more than 2 in 4 seconds).
Example cookie deletion attack (persistent client ID)
The next figure shows details on a web scraping attack started on November 17 at 7:24PM. The attack type was Session Resets by Persistent Client Identification. It occurred when the number of integrity fault events detected through the use of persistent client identification exceeded the configured threshold (more than 3 in 25 seconds).
Example device ID integrity attack
The next figure shows details on a suspicious clients attack that occurred when a client installed the disallowed Scraper browser plug-in.
Example disallowed plug-in attack
Web scraping statistics specify the attack type so you have more information about why the attack occurred. This shows the web scraping attack types that can display in the web scraping event log.
Click the attack type link to display the detected injection ratio and the injection
Note: You cannot configure the Bot activity detected ratio values. This attack type can occur only when the security policy is in Transparent mode.
|Bot Detected||Indicates that the system suspects that the web scraping attack was caused by a web robot.|
|Session Opening Anomaly by IP||Indicates that the web scraping attack was caused by too many sessions being opened from one IP address. Click the attack type link to display the number of sessions opened per second from the IP address, the number of legitimate sessions, and the attack prevention state.|
|Session Resets by Persistent Client Identification||Indicates that the web scraping attack was caused by too many session resets or inconsistencies occurring within a specified time. Click the attack type link to display the number of resets or inconsistencies that occurred within a number of seconds.|
|Suspicious Clients||Indicates that the web scraping attack was caused by web scraping extensions on the browser. Click the attack type link to display the scraping extensions found in the browser.|
|Transactions per session anomaly||Indicates that the web scraping attack was caused by too many transactions being opened during one session. Click the attack type link to display the number of transactions detected on the session.|
This figure shows a Web Scraping Statistics chart on an Application Security Manager™ (ASM) test system where many web scraping attacks occurred during a short period of time.
Web scraping statistics chart
You can use this chart to see the number of rejected requests, web scraping attacks, and total violations that occurred on the web applications protected using the five security policies listed at the bottom.
When you have completed the steps in this implementation, you have configured the Application Security Manager™ to protect against web scraping. The system examines mouse and keyboard activity for non-human actions. Depending on your configuration, the system detects web scraping attacks based on bot detection, session opening violations, session transaction violations, and fingerprinting.
After traffic is flowing to the system, you can check whether web scraping attacks are being logged or prevented, and investigate them by viewing web scraping event logs and statistics.
If fingerprinting is enabled, the system uses browser attributes to help with detecting web scraping. If using fingerprinting with suspicious clients set to alarm and block, the system collects browser attributes and blocks suspicious requests using information obtained by fingerprinting. If you enabled event sequencing, the system looks for irregular event sequences to detect bots.
If you chose alarm and block for the web scraping configuration and the security policy is in the blocking operation mode, the system drops requests that cause the Web scraping detected violation. If you chose alarm only (or the policy is in the transparent mode), web scraping attacks are logged only but not blocked.