Applies To:

Show Versions Show Versions

Manual Chapter: Administrator Guide for the BIG-IP WebAccelerator Module: 6 - Monitoring with SNMP
Manual Chapter
Table of Contents   |   << Previous Chapter   |   Next Chapter >>


6

Monitoring with SNMP


Introducing SNMP

Simple Network Management Protocol (SNMP) is an industry-standard protocol that gives a standard SNMP management system the ability to remotely manage a device on the network. One of the devices that an SNMP management system can manage is a WebAccelerator system.

The SNMP versions that the BIG-IP system supports are: SNMP v1, SNMP v2c, and SNMP v3. The WebAccelerator system implementation of SNMP is based on a well-known SNMP package called Net-SNMP, which was formerly known as UCD-SNMP.

A standard SNMP implementation consists of an SNMP manager, which runs on a management system and makes requests to a device, and an SNMP agent, which runs on the managed device and fulfills those requests. SNMP device management is based on the standard management information base (MIB) known as MIB-II, as well as object IDs and MIB files.

The MIB defines the standard objects that you can manage for a device, presenting those objects in a hierarchical, tree structure. Each object defined in the MIB has a unique object ID (OID), written as a series of integers. The OID indicates the location of the object within the MIB tree.

A set of MIB files resides on both the SNMP manager system and the managed device. MIB files specify values for the data objects defined in the MIB. This set of MIB files consists of standard SNMP MIB files and enterprise MIB files. Enterprise MIB files are those MIB files that pertain to a particular company, such as F5 Networks, Inc.

Typical SNMP tasks that an SNMP manager performs include polling for data about a device, receiving notifications from a device about specific events, and modifying writable object data

You can use SNMP to monitor the following WebAccelerator system operations.

  • Process activities
  • Memory usage
  • Disk usage
  • Critical errors
Note

This chapter assumes that you already have a network management system (NMS) in place for the site's infrastructure. Due to the wide range of NMS products on the market today, this guide cannot describe how to integrate the MIBs into your particular NMS. For information, see the documentation that came with your specific NMS product.

Disabling and enabling SNMP

In the WebAccelerator system, SNMP monitoring is enabled by default.

To disable SNMP

  1. Log into the BIG-IP system as root.
  2. Edit the /config/wa/pvsystem.conf file.
  3. Locate the enableMonitor parameter and change it to false.
  4. Restart the WebAccelerator system processes, by typing the following command:
  5. bigstart restart comm_srv pvac

To enable SNMP

  1. Log into the BIG-IP system as root.
  2. Edit the /config/wa/pvsystem.conf file.
  3. Locate the enableMonitor parameter and change it to true.
  4. Optionally, locate the monitorCommunity parameter and change its value to meet your site's requirements. The default value is Public.
  5. Restart the WebAccelerator system processes, by typing the following command:
  6. bigstart restart comm_srv pvac

SNMP ports

Every process in the WebAccelerator system uses a different SNMP port to monitor the various communications queues, as described in Monitoring the communications system .

The following table identifies processes and ports used for SNMP monitoring.

Table 6.1 SNMP ports
Process
SNMP Port
Communications manager
12400
Administration Server
12402
Server communications manager (both primary and secondary)
12406

Suggested SNMP objects

The following table highlights the MIB objects that we recommend you monitor when running the WebAccelerator system in a production environment.

For more detail information about these objects, see Monitoring WebAccelerator system objects .

Table 6.2 Suggested SNMP objects to monitor
SNMP Object
Description
WA::numReqRejected
WA::numPendingReq
Used to monitor connection throughput.
For more information, see Monitoring connection throughput .
WA::imcHitCount
WA::hdsHitCount
WA::lruLength
WA::lruItemBulk
WA::lruPops
Used to monitor the size and usage of the on-disk and in-memory caches.
For more information, see Monitoring the cache .
WA::sizeCompileQueue
Used to monitor the size of the compile queue.
For more information, see Monitoring the compile queue .
WA::sizeESICompileQueue
Used to monitor the size of the ESI compile queue.
For more information, see Monitoring the ESI compile queue .
WA::sizeAssembleQueue
Used to monitor the size of the assembly queue.
For more information, see Monitoring the assembly queue .
WA::invHashTable
Used to monitor the number of invalidation objects.
For more information, see Monitoring operating system objects .
WA::msgQueueDepth
WA::postQueueDepth
WA::numDupRecv
WA::numExpired
Used to monitor the size and health of the communications channels. For more information, see Monitoring the communications system .

Monitoring WebAccelerator system objects

You can use SNMP to monitor the following WebAccelerator system objects:

  • Connection throughput
  • Cache
  • Compile queue
  • ESI compile queue
  • Assembly queue
  • Invalidation objects
  • Communications system
Note

For specific information about Edge-side includes (ESI) support, see the Policy Management Guide for the BIG-IP WebAccelerator Module.

Monitoring connection throughput

Use the following SNMP objects to monitor the WebAccelerator system's connection load:

  • WA::numReqRejected
    Identifies the number of connection requests rejected by the WebAccelerator system. The WebAccelerator system rejects additional connections when it reaches the maximum number of connections it can service.
  • WA::numPendingReq
    Identifies the number of requests that have been accepted by the WebAccelerator system, but are not currently being processed. A high number of pending requests indicates that the WebAccelerator system is not keeping up with the traffic. This can be caused by:
    • Disk access performance
      If you are using a Network File System (NFS) to access the on-disk cache (hds), and the NFS server is too slow to keep up with the WebAccelerator system's disk reads.
    • Large numbers of assembly events (not caused by ESI includes)
      Assembly events caused by parameter substitution rules, parameter value randomizers, and ESI operations that do not use ESI include statements. Any request for an object that uses ESI includes is given to a separate thread pool for processing.

Monitoring the cache

Monitoring the cache involves watching both the in-memory and on-disk cache to see how effectively they are being used. Use the following SNMP objects to monitor cache:

  • WA::imcHitCount
  • WA::hdsHitCount
  • WA::lruLength
  • WA::lruItemBulk
  • WA::lruPops

Hit counts

WA::imcHitCount identifies the number of cache hits counted against the in-memory cache and WA::hdsHitCount identifies the number of cache hits counted against the on-disk cache. These numbers fluctuate over time as the content expires and the WebAccelerator system sends a request for fresh content. Cache invalidation events can also lower these numbers.

When you are monitoring hit counts, the following conditions are serious and need immediate attention.

  • If both counts are zero, the WebAccelerator system is not serving content from cache. Assuming normal traffic loads for the site, this condition usually indicates that there are broken acceleration policies.
  • If the WA::hdsHitCount is zero and the WA::imcHitCount is non-zero, it represents an internal error condition. If this occurs, contact F5 Networks Technical Support for assistance.

In-memory cache size

WA::lruLength, WA::lruItemBulk, and WA::lruPops are related to the size of the in-memory cache. You can limit the size of the in-memory cache using the pvsystem.confTimcMaxFootprint parameter. The WebAccelerator system continuously puts new objects into the in-memory cache until it either has no more unique objects to cache there, or it reaches the size limit set by imcMaxFootprint variable. If the WebAccelerator system meets the maximum limit for in-memory cache, it removes (or pops) cached objects that have not been accessed for the longest period of time.

If the WebAccelerator system receives a request for an object that it has popped out of in-memory cache, it has to service that request from the on-disk cache. Obtaining data from disk is much slower than obtaining it from memory. If you are seeing a large number of pops as reported by WA::lruPops, increase the size of the in-memory cache (if local system resources allow it) to improve performance.

If you cannot increase the size of the in-memory cache footprint, you may be able to reduce the number of pops by reducing the number of unique objects that the WebAccelerator system is caching. You can accomplish this through content variation rules, as follows.

  • Examine your site to see if there are any legacy query parameters or query parameter values that do not affect content. If so, identify those parameters in the content variation rules.
  • Examine current content variation rules to see if they have identified any parameters, such as cookies or user agent values, that do affect content. If so, remove them from the acceleration policy.

WA::lruItemBulk represents the current footprint size of the in-memory cache. It should never be larger than the value set by the imcMaxFootprint parameter.

WA::lruLength represents the number of objects that are stored in the in-memory cache. Assuming that you are not exceeding the memory available to the in-memory cache, you should see this value fluctuate as objects expire from the cache in accordance with the TTL settings or invalidation objects.

Note

For specific information about configuring variation rules, see Chapter 4, Configuring Variation Rules, of the Policy Management Guide for the WebAccelerator Module.

Monitoring the compile queue

Each response that is compiled is first placed in the compile queue. You can monitor the number of objects in the compile queue using the WA::sizeCompileQueue SNMP object.

The compile queue is affected by the size of the pages that it must compile. If a site responds to a request with an erroneously large page, the compile queue grows as it attempts to handle the larger page. Therefore, if you experience unexpected spikes in the compile queue, examine your site's code to see if it is building pages correctly.

Compilation is a very efficient process, therefore, the compile queue is typically small. Most sites normally see a compile queue size of 0 or 1. If the compile queue is consistently large (greater than 5), it indicates that the WebAccelerator system is not keeping up with the workload. If this is occurring, you can attempt to resolve it by changing the number of threads assigned to the compile queue to 2 (by default, only 1 thread is used). You do this using the pvsystem.conf maxParserThreads parameter. See Managing the compile queues , located in Chapter 5, for more information.

The maximum number of responses allowed in the compile queue is defined by the pvsystem.conf maxParserTasks parameter. By default, this parameter is set to 1000. If the compile queue is at this limit, the WebAccelerator system is failing to cache some responses. This can result in a greater than anticipated load on the origin servers. In this situation, reducing the size of the queue also reduces the load on the origin servers.

Monitoring the ESI compile queue

A surrogate-control header indicates that a response uses ESI markup. When the WebAccelerator system sends a request to the origin servers and receives a response that uses the surrogate-control header, it places the request in the ESI compile queue for compilation into a compiled response. (See Monitoring the compile queue for more information about compile queues.)

You can use the WA::sizeESICompileQueue SNMP object to monitor the number of objects in the ESI compile queue. The size of this queue differs from site to site, depending on the size of the templates and fragments that you are compiling. The number of fragments (that is, the number of ESI includes) that the WebAccelerator system is processing also affects the size of this queue.

The maximum number of responses that can be in the ESI compile queue is defined by the pvsystem.conf maxESIParserTasks parameter. By default, this parameter is set to 1000. If the ESI compile queue reaches this limit, the WebAccelerator system is failing to cache some responses. The result can be a greater than anticipated load on the origin servers. In this situation, reducing the size of the queue will also reduces the load on the origin servers.

Use the following tips to reduce the size of the ESI compile queue.

  • Examine your site's code. There should be no recursive includes.
  • Examine your site's code to see if you can reduce the size or number of fragments you are processing.
  • Try increasing the number of threads used to process the ESI compile queue to 2 (by default, only 1 thread is used). You do this using the pvsystem.conf maxEsiParserThreads parameter. See Managing the compile queues , located in Chapter 5, for more information.
  • As with the compile queue, a sudden spike in the size of the ESI compile queue can indicate a problem with the code that generates your site. Therefore, if you experience unexpected spikes in the ESI compile queue, examine your site's code to see if it is building pages in an appropriate manner.

Monitoring the assembly queue

You can monitor the size of the assembly queue using the WA::sizeAssembleQueue SNMP object. Assembly is a very efficient process. For sites that are not using ESI, the assembly queue should be quite small. Most sites normally have a queue size of 0 or 1.

If you are using ESI, the size of the assembly queue can grow in direct proportion to the complexity of the ESI instructions. If the assembly queue grows too large, users will perceive a reduction in your site's responsiveness. For this reason, it is a good idea to keep the assembly queue as small as possible. If your site seems to be responding too slowly and the assembly queue is persistently large, reduce the size of the queue.

The maximum number of requests that can be in the assembly queue is defined by the pvsystem.conf maxAssembleTasks parameter. By default, this parameter is set to 1000. If the assembly queue reaches this limit, then clients are receiving HTTP service not responding errors. In this situation, use the following tips to reduce the size of the assembly queue:

  • Examine your site's code. There should be no recursive ESI includes.
  • Examine your site's code to see if you can reduce the size or number of ESI fragments that you are processing.
  • Try increasing the number of threads used to process the assembly queue to 2 (by default, only 1 thread is used). You do this using the pvsystem.conf maxAssembleThreads parameter. See Managing the assembly queue , located in Chapter 5, for more information.

Monitoring invalidation objects

By monitoring invalidation objects, you can ensure that invalidation objects are properly propagating to the WebAccelerator.

Note

Different sites rely on cache invalidation to a greater or lesser extent. If you are using manual invalidation only occasionally to clear parts of the cache for some sites, you probably do not need to monitor those invalidation objects as closely as you do those sites for which you are heavily using invalidation triggers or ESI cache invalidation.

You can monitor the number of invalidations using the WA::invHashTable object. This object provides a table of hashes that represents all the invalidation requests in the system at the moment that each time bucket is captured. The table has one row and 10 columns. All, but the last (far-right) column, represent a 10-second bucket. The far-right column represents the fraction of a bucket when you requested the table. For example, if you requested the table at 10:50:43, then the last column represents a 3-second bucket.

Depending on how quickly invalidation objects propagate through your system, the right-most columns might not show the same hash values system-wide. The first 5 to 7 columns should be identical, but are dependent on local conditions such as network latency and the number of invalidation objects that are being created.

For more information about ESI and invalidation see the Policy Management Guide for the BIG-IP WebAccelerator Module.

Monitoring the communications system

The WebAccelerator system uses the communications managers to communicate between processes. It is critical that these inter-process communications are working correctly. Any failure can mean that the acceleration policies or the invalidation objects did not propagate. A misconfigured communications system can cause hit and change log files to build up in the WebAccelerator system's archive directory.

Monitoring the communications channel

When monitoring the communications channel, watch for:

  • A message queue that is always increasing
    This indicates messages are not leaving the process. You can monitor the message queue depth using the WA::msgQueueDepth SNMP object.
  • A retry queue depth that is more than 10 or 20 messages in size
    This can indicate a failure to receive acknowledgements. The retry queue depth can be monitored using the WA::postQueueDepth SNMP object.
  • The number of duplicate messages received by a process is more than 10 or 12
    This can indicate a failure to send acknowledgements. You can monitor the number of duplicate messages using the WA::numDupRecv SNMP object.
  • Whether there are any expired messages noted
    This indicates high network latency, or that system clocks are out of sync. You can monitor the number of expired messages that have been processed using the WA::numExpired SNMP object.
Note

Some queues, especially the message queue, can grow larger than is indicated if you invalidate a large number of objects.

Troubleshooting the communications system

Most failures in the communications system are usually caused by one of the following.

  • A process was unable to obtain the port that is defined for it in the pvsystem.conf file.
    This is most likely caused because a process did not shut down properly in the past. Use the netstat(8) command to see what process is already using the port.
  • A process was unable to connect to its upstream (parent) process.
    This usually happens when the upstream process is on a remote host. Possible reasons for the failure are:
    • The upstream process is not running
      In this case, starting it should correct the problem.
    • The upstream process is not using the expected port
      In this case, you must correct the pvsystem.conf file on either the upstream or downstream systems.
    • Misconfigured network
      Make sure the two systems can ping each other. Make sure the host names identified in pvsystem.conf are configured in DNS and that they resolve to the correct IP addresses.
    • Misconfigured firewall
      The firewall must be open for TCP/IP traffic, both incoming and outgoing, for all relevant hosts and ports.
  • SSL certificate issues
    By default, the communications system uses certificate-based encrypted traffic (SSL v3). If the certificates in use are corrupted or expired, or not yet valid in certain pathological cases, then communications can not be established.
  • If you suspect a corrupted or invalid certificate, check the validity of the certificate in using the following command on both the upstream and downstream systems:

    openssl verify /usr/local/wa/config/ssl/pvssl.pem

    If the last line of output is not OK, there is a problem with the certificate. Note that Error 18 (self signed certificate) is reported for all default WebAccelerator system installations. This is considered an acceptable error and openssl still reports OK in this situation.

  • Time is not synchronized
    The communications system is sensitive to time synchronization issues. For this reason, the WebAccelerator system uses NTP to keep all system clocks synchronized. However, in some circumstances the clocks might be too far out of sync for NTP to be able to correct the differential. The limit is 1000 seconds. This limit can be exceeded if there is a hardware clock failure on a host.

Always check the time set for the hosts in your WebAccelerator system installation whenever you see communication problems. If they are not synchronized, make sure that NTP is running on each host and check to see if NTP is reporting any error conditions in /var/log/wa/messages on Linux.

Monitoring operating system objects

The objects used to monitor WebAccelerator system processes are located in /usr/local/wa/mib. In addition to the various WebAccelerator system processes, you should also monitor the hosts on which those processes are running. This involves watching CPU, memory, and disk utilization to ensure you are not overloading your systems. F5 Networks recommends that you use the ucd-snmp MIB for this purpose, to monitor the following objects.

Table 6.3 Recommended operating system objects to monitor
Object
Description
if::ifOutOctets
Identifies the amount of data transmitted out of the network interface.
If this value stops increasing, it indicates that there is either a problem with the WebAccelerator system processes or with your network in general.
ucd-snmp::dskAvail
Identifies the available space on disk.
Available disk space can become too low because (1) the on-disk cache is filling up its partition, or (2) the log files are filling up their partition.
If the on-disk cache is growing too large for its partition, because the WebAccelerator system is caching more objects than there is available disk space, you can resolve it by increasing the size of the on-disk cache partition. If the on-disk cache is growing too large, because it is not being pruned correctly (that is, objects are not being deleted from disk), verify that the hds_prune script is operating correctly.
ucd-snmp::dskPercent
Identifies the amount of space used on disk as a percentage of the total size of the disk.
You can use this as another way of identifying the same problems described for ucd-snmp::dskAvail.
ucd-snmp::laLoadInt
Identifies CPU load.
ucd-snmp::memAvailReal
Amount of unused memory available on the host.
It is especially important to make sure that the WebAccelerator system is not swapping, because this slows it down. If you see the available memory approaching 0, consider reducing the size of the in-memory cache.
Note that for Linux systems, available real memory is almost always identified as 0. For this reason, it is better to monitor the amount of available swap space (ucd-snmp::memAvailSwap following) to see if Linux systems are swapping.
ucd-snmp::memAvailSwap
Amount of unused swap space available on the host.
ucd-snmp::prErrorFlag
Identifies if a process is reporting an error.
Use this object to verify that the various WebAccelerator processes are operating without issue. If any process are reporting an a prErrorFlag value other than 0, investigate that process to see if there is a problem that requires your attention.




Table of Contents   |   << Previous Chapter   |   Next Chapter >>

Was this resource helpful in solving your issue?




NOTE: Please do not provide personal information.



Incorrect answer. Please try again: Please enter the words to the right: Please enter the numbers you hear:

Additional Comments (optional)