Applies To:

Show Versions Show Versions

Manual Chapter: Detecting and Preventing Web Scraping
Manual Chapter
Table of Contents   |   << Previous Chapter   |   Next Chapter >>

Overview: Detecting and preventing web scraping

Web scraping is a technique for extracting information from web sites that often uses automated programs, or bots (short for web robots), opening many sessions, or initiating many transactions. You can configure Application Security Manager™ (ASM) to detect and prevent various web scraping activities on web sites that it is protecting.

ASM™ provides the following methods to address web scraping attacks. These methods can work independently of each other, or work together to detect and prevent web scraping attacks.

  • Bot detection investigates whether a web client source is human by limiting the number of page changes allowed within a specified time.
  • Session opening detects an anomaly when either too many sessions are opened from an IP address or when the number of sessions exceeds a threshold from an IP address. Also, session opening can detect an attack when the number of inconsistencies or session resets exceeds the configured threshold within the defined time period. This method also identifies as an attack an open session that sends requests that do not include an ASM cookie.
  • Session transactions anomaly captures sessions that request too much traffic, compared to the average amount observed in the web application. This is based on counting the transactions per session and comparing that to the average amount observed in the web application.

The BIG-IP® system can accurately detect such anomalies only when response caching is turned off.

Task Summary

Prerequisites for configuring web scraping

For web scraping detection to work properly, you should understand the following prerequisites:

  • The web scraping mitigation feature requires that the DNS server is on the DNS lookup server list. Go to System > Configuration > Device > DNS to see if the DNS lookup server is on the list. If not, add it and restart the system.
  • Client browsers need to have JavaScript enabled, and support cookies for anomaly detection to work.
  • Consider disabling response caching. If response caching is enabled, the system does not protect cached content against web scraping.
  • The Application Security Manager™ does not perform web scraping detection on legitimate search engines traffic. If your web application has its own search engine, we recommend you add it to the system. Go to Security > Options > Application Security > Advanced Configuration > Search Engines and add it to the list.

Detecting web scraping based on bot detection

You can mitigate web scraping on the web sites Application Security Manager™ defends by attempting to determine whether a web client source is human or a web robot. The bot detection method also protects web applications against rapid surfing by measuring the amount of time allowed to change a number of web pages before the system suspects a bot.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping. The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Bot Detection setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects that a bot is sending requests to the web application. If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks.
    Note: The system can accurately detect a human user only if clients have JavaScript enabled and support cookies in their browsers.
    The screen displays the Bot Detection tab and more settings.
  4. If you want to protect client identification data (when using Bot Detection or Session Opening detection), specify the persistence settings.
    1. Select the Persistent Client Identification check box.
    2. For Persistent Data Validity Period, type how long you want the client data to persist in minutes. The default value is 120 minutes.
    Note: This setting enforces persistent storage on the client and prevents easy removal of client data. Be sure that this behavior is compatible with the application privacy policy.
    The system maintains client data and prevents removal of this data from persistent storage for the validity period specified.
  5. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  6. On the Bot Detection tab, for the Rapid Surfing setting, specify the maximum number of web pages that can be changed in the specified number of seconds before the system suspects a bot. The default value is Maximum 5 page changes per 1000 milliseconds.
  7. For Grace Interval, type the number of requests to allow while determining whether a client is human. The default value is 100.
  8. For Unsafe Interval, type the number of requests that cause the Web Scraping Detected violation if no human activity was detected during the grace period. The default value is 100. Reaching this interval causes the system to reactivate the grace period.
  9. For Safe Interval, type the number of requests to allow after human activity is detected, and before reactivating the grace threshold to check again for non-human clients. The default value is 2000.
  10. Click Save to save your settings.
  11. To put the security policy changes into effect immediately, click Apply Policy.
The system checks for rapid surfing and if too many pages are changed too quickly, it logs Web Scraping detected violations in the event log, and specifies the attack type of bot detection.
After setting up bot detection, you can also set up session opening and session transactions anomaly detection for the same security policy.

Detecting web scraping based on session opening

You can configure how the system protects your web application against session opening web scraping violations that result from too many sessions originating from a specific IP address, inconsistencies detected in persistent storage, and the number of session resets exceeds the threshold.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping. The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Session Opening setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects a large increase in the number of sessions opened from a. specific IP address, or when the number of session resets or inconsistencies exceeds the set threshold. If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks. The screen displays the Session Opening tab and more settings.
  4. If you want to protect client identification data (when using Bot Detection or Session Opening detection), specify the persistence settings.
    1. Select the Persistent Client Identification check box.
    2. For Persistent Data Validity Period, type how long you want the client data to persist in minutes. The default value is 120 minutes.
    Note: This setting enforces persistent storage on the client and prevents easy removal of client data. Be sure that this behavior is compatible with the application privacy policy.
    The system maintains client data and prevents removal of this data from persistent storage for the validity period specified.
  5. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  6. To detect session opening anomalies by IP address, on the Session Opening tab, select the Session Opening Anomaly check box and adjust the settings.
  7. For the Prevention Policy setting, select one or more options to direct how the system should handle a session opening anomaly attack.
    Option Description
    Client Side Integrity Defense When enabled, the system determines whether a client is a legitimate browser or an illegal script by sending a JavaScript challenge to each new session request. Legitimate browsers can respond to the challenge; scripts cannot.
    Rate Limiting When enabled, the system drops sessions from suspicious IP addresses after determining that the client is an illegal script. If you select this option, the screen displays an option for dropping requests from IP addresses with a bad reputation.
    Drop IP Addresses with bad reputation When enabled, the system drops requests originating from IP addresses that are in the system’s IP address intelligence database when the attack is detected; no rate limiting will occur. (Attacking IP addresses that do not have a bad reputation undergo rate limiting, as usual.) This option is available only if you have enabled rate limiting. You also need to set up IP address intelligence, and at least one of the IP intelligence categories must have its Alarm flag enabled.
  8. For the Detection Criteria setting, specify the criteria under which the system considers traffic to be a session opening anomaly attack.
    Option Description
    Sessions opened per second increased by The system considers traffic to be an attack if the number of sessions opened per second increased by this percentage. The default value is 500%.
    Sessions opened per second reached The system considers traffic to be an attack if the number of sessions opened per second is equal to or greater than this number. The default value is 50 sessions opened per second.
    Minimum sessions opened per second threshold for detection The system only considers traffic to be an attack if this value plus one of the sessions opened values is exceeded. The default value is 25 sessions opened per second.
    Note: The Detection Criteria values all work together. The minimum sessions value and one of the sessions opened values must be met for traffic to be considered an attack. However, if the minimum sessions value is not reached, traffic is never considered an attack even if the Sessions opened per second increased by value is met.
  9. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds. If the attack ends before this number of seconds, the system also stops attack prevention.
  10. If you enabled Persistent Client Identification and you want to detect session opening anomalies based on inconsistencies, select the Device Identification Integrity check box, and set the maximum number of integrity faulty events to allow within a specified number of seconds. The system tracks the number of inconsistent device integrity events within the time specified, and if too many events occurred within the time, a Web scraping detection violation occurs.
  11. If you enabled Persistent Client Identification and you want to track cookie deletion events, select the Cookie Deletion Detection check box, and set the maximum number of cookie deletions to allow within a specified number of seconds. The system tracks the number of cookie deletion events that occur within the time specified, and if too many cookies were deleted within the time, a Web scraping detection violation occurs.
  12. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds. If the attack ends before this number of seconds, the system also stops attack prevention.
  13. Click Save to save your settings.
  14. To put the security policy changes into effect immediately, click Apply Policy.
The system checks for too many sessions being opened from one IP address, too many cookie deletions, or persistent storage inconsistencies. The system logs violations in the web scraping event log along with information about the attack including whether it is a Session Opening Anomaly by IP Address or Session Resets by Persistent Client Identification attack type and when it began and ended. The log also includes the type of violation (Device Identification Integrity or Cookie Deletion Detection) and the violation numbers.
After setting up session opening detection, you can also set up bot detection and session transactions anomaly detection for the same security policy.

Detecting web scraping based on session transactions

You can configure bot detection to specify how the system protects your web application from being harvested by non-human web robots. Bot detection settings define when to consider that multiple requests may have been initiated by web robots.
  1. On the Main tab, click Security > Application Security > Anomaly Detection > Web Scraping. The Web Scraping screen opens.
  2. In the Current edited policy list near the top of the screen, verify that the edited security policy is the one you want to work on.
  3. For the Session Transactions Anomaly setting, select either Alarm or Alarm and Block to indicate how you want the system to react when it detects  a large increase in the number of transactions per session. If you choose Alarm and Block, the security policy enforcement mode needs to be set to Blocking before the system blocks web scraping attacks. The screen displays the Session Transactions Anomaly tab and more settings .
  4. For the IP Address Whitelist setting, add the IP addresses and subnets from which traffic is known to be safe.
    Important: The system adds any whitelist IP addresses to the centralized IP address exceptions list. The exceptions list is common to both brute force prevention and web scraping detection configurations.
  5. For the Detection Criteria setting, specify the criteria under which the system considers traffic to be a session transactions anomaly attack.
    Option Description
    Session transactions increased by The system considers traffic to be an attack if the number of transactions per session increased by this percentage. The default value is 500%.
    Sessions transactions reached The system considers traffic to be an attack if the number transactions per sessions is equal to or greater than this number. The default value is 400 transactions.
    Minimum session transactions threshold for detection The system considers traffic to be an attack only if the number of transactions per session is equal to or greater than this number, and at least one of the sessions transactions numbers was exceeded. The default value is 200 transactions.
    Note: The Detection Criteria values all work together. The minimum sessions value and one of the sessions transactions values must be met for traffic to be considered an attack. However, if the Minimum session transactions threshold is not reached, traffic is never considered an attack even if the Sessions transactions increased by value is met.
  6. For Prevention Duration, type a number that indicates how long the system prevents an anomaly attack by logging or blocking requests. The default is 1800 seconds. If the attack ends before this number of seconds, the system also stops attack prevention.
  7. Click Save to save your settings.
  8. To put the security policy changes into effect immediately, click Apply Policy.
When the system detects a session that requests too many transactions (as compared to normal), all transactions from the attacking session cause the Web Scraping detected violation to occur until the end of attack or until the prevention duration expires.

Displaying web scraping event logs

You can display event logs to see whether web scraping attacks have occurred, and view information about the attacks.
  1. On the Main tab, click Security > Event Logs > Application > Web Scraping Statistics. The Web Scraping Statistics event log opens.
  2. If the log is long, you can filter the list by security policy and time period to show more specific entries.
  3. Review the list of web scraping attacks to see the web scraping attack type that occurred, the IP address of the client that caused the attack, which security policy detected the attack, and the start and end times of the attack.
  4. Examine the web scraping statistics shown, and click the attack type links to see what caused the attack.
  5. To learn more about the requests that caused the web scraping attack, click the number of violating requests. The Requests screen opens where you can investigate the requests that caused the web scraping attacks.

Web scraping statistics figure

This figure shows a Web Scraping Statistics event log where a persistent storage violation occurred. You can click the attack type to view additional details about the cause of the attack (as shown in the figure).

Web scraping statistics Web scraping statistics screen

The figure shows that a web scraping attack occurred on 2-19-2013 from 10:08 to 10:13. It was caused by too many session resets (more than 10 in 93 seconds) and inconsistencies (more than 3 in 60 seconds).

Web scraping attack types

Web scraping statistics specify the attack type so you have more information about why the attack occurred. This shows the web scraping attack types that can display in the web scraping event log.

Attack Type Description
Bot Detection Indicates that the system suspects that the web scraping attack was caused by a web robot.
Session Opening Anomaly by IP Indicates that the web scraping attack was caused by too many sessions being opened from one IP address. Click the attack type link to display the number of sessions opened per second from the IP address, the number of legitimate sessions, and the attack prevention state.
Session Resets by Persistent Client Identification Indicates that the web scraping attack was caused by too many session resets or inconsistencies occurring within a specified time. Click the attack type link to display the number of resets or inconsistencies that occurred within a number of seconds.
Transactions per session anomaly Indicates that the web scraping attack was caused by too many transactions being opened during one session. Click the attack type link to display the number of transactions detected on the session.
Transparent Mode CS injection ratio Indicates that there are more JavaScript injections than JavaScript replies. Click the attack type link to display the detected injection ratio and the injection ratio threshold.
Note: You cannot configure the Transparent Mode CS injection ratio values. This attack type can occur only when the security policy is in Transparent mode.

Viewing web scraping statistics

Before you can look at the web scraping attack statistics, you need to have configured web scraping protection.
You can display charts that show information about web scraping attacks that have occurred against protected applications.
  1. On the Main tab, click Security > Reporting > Application > Web Scraping Statistics. The Web Scraping Statistics screen opens.
  2. From the Time Period list, select the time period for which you want to view information about web scraping attacks.
  3. If you want to export the report to a file or send it by email, click Export and select the options. To send reports by email, you need to specify an SMTP configuration (System > Configuration > Device > SMTP).
The statistics show the total number of web scraping attacks, violations and rejected requests that occurred. You can review the details about the attacks and see that mitigation is in place.

Implementation Result

When you have completed the steps in this implementation, you have configured the Application Security Manager™ to protect against web scraping. Depending on your configuration, the system detects web scraping attacks based on bot detection, session opening violations, or session transaction violations.

After traffic is flowing to the system, you can check whether web scraping attacks are being logged or prevented, and investigate them by viewing web scraping event logs and statistics.

If you chose alarm and block for the web scraping configuration and the security policy is in the blocking operation mode, the system drops requests that cause the Web scraping detected violation. If you chose alarm only (or the policy is in the transparent mode), web scraping attacks are logged only but not blocked.

Table of Contents   |   << Previous Chapter   |   Next Chapter >>

Was this resource helpful in solving your issue?




NOTE: Please do not provide personal information.



Incorrect answer. Please try again: Please enter the words to the right: Please enter the numbers you hear:

Additional Comments (optional)