Applies To:

Show Versions Show Versions

sol12531: Troubleshooting health monitors
TroubleshootingTroubleshooting

Original Publication Date: 02/01/2011
Updated Date: 04/29/2014

Overview

A monitor is an important BIG-IP feature that verifies connections to pool members or nodes. A health monitor is designed to report the status of a pool, pool member, or node on an ongoing basis, at a set interval. When a health monitor marks a pool, pool member, or node down, the BIG-IP system stops sending traffic to the device.

A failing or misconfigured health monitor may cause traffic management issues similar, but not limited, to the following:

  • Connections to the virtual server are interrupted or fail.
  • Web pages or applications fail to load or execute.
  • Certain pool members or nodes receive more connections than others.

The previously-mentioned symptoms may indicate that a health monitor is marking a pool, pool member, or node down indefinitely, or that a monitor is repeatedly marking a pool member or node down and then back up (often referred to as a bouncing pool member or node). For example, if a misconfigured health monitor constantly marks pool members down and then back up, connections to the virtual server may be interrupted or fail altogether. You will then need to determine whether the monitor is misconfigured, the device or application is failing, or some other factor is occurring that is causing the monitor to fail (such as network-related issue). The troubleshooting steps you take will depend on the monitor type and the observed symptoms.

When experiencing health monitor issues, you can use the following troubleshooting steps:

Identifying a failing health monitor

The BIG-IP software includes utilities (such as the Configuration utility, command line, or SNMP) that you can use to alert an administrator or help identify when a health monitor marks down a pool, pool member, or node. The utilities are defined in the following sections.

Configuration utility

The following table lists Configuration utility pages where you can check the status of pools, pool members, and nodes:

Configuration utility page Description
Location
Network map Summary of pools, pool members, and nodes
Local Traffic > Network Map > Show Map
Pools
Current status of pool / members
Local Traffic > Pools > Statistics
Pool members Current status of pool / members Local Traffic > Pools > Statistics
Nodes Current status of nodes
Local Traffic > Nodes > Statistics

Command line utilities

The following table lists command line utilities that allow you to monitor the status of pools, pool members, and nodes:

CLI utility Description Example commands
bigtop Live statistics for pool members and nodes bigtop -n
bigpipe Statistical information about pools, pool members, and nodes bigpipe pool show, bigpipe node show
tmsh (10.x, 11.x)
Statistical information about pools, pool members, and nodes tmsh show /ltm pool <pool_name>
tmsh show /ltm node <node_IP>

Note: For more information about the bigtop utility, refer to SOL7318: Overview of the bigtop utility.

Note: For more information about the bigpipe utility, refer to the BIG-IP Command Line Interface Guide for BIG-IP 9.4.x, and the Bigpipe Utility Reference Guide for BIG-IP 10.x.

Note: For more information about the tmsh utility, refer to the Traffic Management Shell (tmsh) Reference Guide.

Logs

The BIG-IP system logs messages related to the health monitor to the /var/log/ltm file. Reviewing the log files is one way to determine the frequency with which the system is marking down pool members and nodes. Logging related to monitor state changes are as follows:

  • Pools

    When a health monitor marks all members of a pool down or up, messages that appear similar to the following example are logged to the /var/log/ltm file:

    tmm err tmm[4779]: 01010028:3: No members available for pool <Pool_name>
    tmm err tmm[4779]: 01010221:3: Pool <Pool_name> now has available members

  • Pool members

    When a health monitor marks pool members down or up, messages that appear similar to the following example are logged to the /var/log/ltm file:

    notice mcpd[2964]: 01070638:5: Pool member <ServerIP_port> monitor status down
    notice mcpd[2964]: 01070727:5: Pool member <ServerIP_port> monitor status up.

  • Nodes

    When a health monitor marks a node down or up, messages that appear similar to the following example are logged to the /var/log/ltm file:

    notice mcpd[2964]: 01070640:5: Node <ServerIP> monitor status down.
    notice mcpd[2964]: 01070728:5: Node <ServerIP> monitor status up.

SNMP

When the BIG-IP system is configured to send SNMP traps and a health monitor marks a pool member or node down or up, the system sends the following traps:

  • Pool members

    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.10"
    }
    alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.11"
    }

  • Nodes

    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.12"
    }
    alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS_UP {
    snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.13"
    }

Verifying monitor settings

It is important to verify that monitor settings are properly defined for your environment. For example, F5 recommends that you configure most monitors with a timeout value of three times the interval value, plus one. This is to prevent the monitor from marking the node down before the last check is sent.

Simple monitors

A simple monitor is used to verify the status of the destination node (or the path to the node through a transparent device). The BIG-IP system provides the following pre-configured simple monitor types: gateway_icmp, icmp, tcp_echo, tcp_half_open. If you determine that a simple monitor is marking a node down, you can verify the following settings:

Note: There are other monitor settings that can be defined for simple monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management.

  • Interval/timeout ratio

    Configuring an appropriate interval/timeout ratio is important for simple monitors. In most cases, the interval/timeout should have a timeout value of three times the interval, plus one. For example, the default ratio is 5/16. Verify that the ratio is properly defined.

  • Transparent

    A transparent monitor uses a path through the associated node to monitor the aliased destination. Verify that the destination target device is reachable and configured properly for the monitor.

Extended Content Verification (ECV) monitors

ECV monitors use Send and Receive string settings to retrieve content from pool members or nodes. The BIG-IP system provides the following pre-configured monitor types: tcp, http, https, and https_443. If you determine that a simple monitor is marking a node down, you can verify the following settings:

Note: There are other monitor settings that can be defined for ECV monitors. For more information, refer to the Configuration Guide for BIG-IP Local Traffic Management.

  • Interval/timeout ratio

    As with simple monitors, configuring the interval/timeout ratio is important for ECV monitors. In most cases, the interval/timeout should have a timeout value of three times the interval, plus one. For example, the default ratio for EVC monitors is 5/16. Verify that the ratio is properly defined.

  • Send string

    The Send string is a text string that the monitor sends to the pool member. The default setting is GET /, which retrieves a default HTML file for a website. If the Send string is not properly constructed, the server may send an unexpected response and be subsequently marked down by the monitor. For example, if the server requires the monitor request to be HTTP/1.1 compliant, you will need to adjust the monitor Send string.

    Note: For information about modifying HTTP requests for use with HTTP or HTTPS application health monitors, refer to the following articles:

    SOL2167: Constructing HTTP requests for use with the HTTP or HTTPS application health monitor
    SOL3224: HTTP health checks may fail even though the node is responding correctly
    SOL10655: Change in Behavior: CR/LF characters appended to the HTTP monitor Send string

  • Receive string

    The Receive string is the regular expression representing the text string that the monitor looks for in the returned resource. ECV monitors requests may fail and mark the pool member down if the Receive string is not configured properly. For example, if the Receive string appears too late in the server response, or the server responds with a redirect, the monitor marks the pool member down.

    Note: For information about modifying the monitor to issue a request to a redirection target, refer to SOL3224: HTTP health checks may fail even though the node is responding correctly.

  • User name and password

    ECV monitors have User Name and Password settings, which can be used for resources that require authentication. Verify whether the pool member requires authentication and ensure that the fields contain valid credentials.

Troubleshooting monitor types

Simple monitors

Troubleshooting connectivity issues for a simple monitor is fairly straightforward. If you determine that a monitor is marking a node down (or the node is bouncing), you can use the following steps to troubleshoot the issue:

  1. Determine the IP address of the nodes being marked down.

    You can determine the IP address or the nodes that the monitor is marking down by using the Configuration utility, command line utilities, or log files. You can quickly search the /var/log/ltm file for node status messages using command syntax that appears similar to the following example:

    # cat /var/log/ltm |grep 'Node' |grep 'status'
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.1 monitor status down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 172.24.64.4 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.200 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.122 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.100 monitor status unchecked.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 11.1.1.1 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.3 monitor status down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.229 monitor status down.

    Note: If a large number of nodes are being marked down (or bouncing), you can sort the results by IP addresses.

    For example:

    cat /var/log/ltm |grep 'Node' |grep 'status' | sort -t . -k 3,3n -k 4,4n

  2. Check connectivity to the node.

    If there are occurrences of node addresses being marked down and not back up, or nodes bouncing, check the connectivity to the nodes from the BIG-IP system, using commands such as ping, traceroute (BIG-IP 10.x, 11.x) or tracepath (BIG-IP 9.x). For example, if you have determined that a simple monitor is marking the node address 10.10.65.1 down, you can attempt to ping the resource from the BIG-IP system as follows:

    # ping -c 4 10.10.65.1
    PING 10.10.65.1 (10.10.65.1) 56(84) bytes of data.
    64 bytes from 10.10.65.1: icmp_seq=1 ttl=64 time=11.32 ms
    64 bytes from 10.10.65.1: icmp_seq=2 ttl=64 time=8.989 ms
    64 bytes from 10.10.65.1: icmp_seq=3 ttl=64 time=10.981 ms
    64 bytes from 10.10.65.1: icmp_seq=4 ttl=64 time=9.985 ms

    Note: The previous ping output shows high round trip times, which may indicate a network issue or a slow responding node.

    In addition, make sure that the node is configured to respond to the simple monitor. For example, tcp_echo is a simple monitor type that requires that the TCP echo service is enabled on the nodes being monitored. The BIG-IP sends SYN segment with information to be echoed by the receiving device.

  3. Check the monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval / timeout ratio) are appropriate for the node.

    For example, the following bigpipe command lists the configuration for the icmp_new monitor:

    bigpipe monitor icmp_new list

    The following tmsh command lists the configuration for the icmp_new monitor:

    tmsh list /ltm monitor icmp_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

  5. Use the tcpdump command to capture monitor traffic.

    If you are unable to determine the cause of a failing health monitor, it may be necessary to perform packet captures on the BIG-IP system.

    Note: For more information about running tcpdump, refer to SOL411: Overview of packet tracing with the tcpdump utility.

ECV monitors

Troubleshooting issues for ECV monitors involves several steps. If you determine that an ECV monitor is marking a pool member down (or the pool member is bouncing), you can use the following steps to troubleshoot the issue:

  1. Determine the IP address of the pool members that the monitor is marking down by using the Configuration utility, command line utilities, or log files.

    For example, search the /var/log/ltm file for pool member status messages as follows:

    # cat /var/log/ltm.2.gz |grep -i 'pool member' |grep 'status'
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:21 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor status node down.
    Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status node down.
    Jan 21 15:05:05 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor status unchecked.

  2. Check connectivity to the pool member.

    As previously stated, check the connectivity to the pool members from the BIG-IP system using the ping or traceroute commands.

  3. Check the ECV monitor settings.

    Use the Configuration utility or command line utilities to verify that the monitor settings (such as the interval / timeout ratio) are appropriate for the pool members.

    For example, the following bigpipe command lists the configuration for the http_new monitor:

    bigpipe monitor http_new list

    The following tmsh command lists the configuration for the http_new monitor:

    tmsh list /ltm monitor http_new

  4. Create a custom monitor (if needed).

    If you are using a default monitor and have determined that the settings are not appropriate for your environment, consider creating and testing a new monitor with custom settings.

  5. Test the response from the application.

    Use a command line utility on the BIG-IP system to test the response from the web application. For example, the following command uses the curl (and time) command and attempts to transfer data from the web server while timing the response:

    # time curl http://10.10.65.1
    <html>
    <head>
    ---
    </body>
    </html>
    real 0m18.032s
    user 0m0.030s
    sys 0m0.060s

    Note: If you want to test a specific HTTP request, including HTTP headers, you can use the telnet command to connect to the pool member.

    For example:

    telnet <serverIP> <serverPort>

    Next, at the prompt, enter an appropriate HTTP request line and HTTP headers, pressing Enter once after each line.

    For example:

    GET / HTTP/1.1 <enter>
    Host: www.yoursite.com <enter>
    Connection: close <enter>
    <enter>

  6. Use the tcpdump command to capture monitor traffic.

    Note: For more information about running tcpdump, refer to SOL411: Overview of packet tracing with the tcpdump utility.

Troubleshooting daemons related to health monitoring

The bigd process manages health checking for pool members, nodes, and services on the BIG-IP LTM system. The bigd process collects health checking status and communicates the status information to the mcpd process, which stores the data in shared memory so that the TMM can read it. If you are having monitoring issues, you can check the memory utilization of the bigd process. If the %MEM is unusually high, or continually increases, the process may be leaking memory.

For example, to check the current memory utilization of bigd, use the ps command:

# ps aux |grep bigd

USER     PID     %CPU     %MEM     VSZ     RSS     TTY     STAT     START     TIME     COMMAND
root     3020     0.0          0.6      28208   10488     ?        S         2010         5:08     /usr/bin/bigd

Note: If the bigd process fails, the health check status of pool members, nodes, and services remain in their current state until the bigd process is restarted. For more information, refer to SOL6967: When the BIG-IP LTM bigd daemon fails, the health check status of pool members, nodes, and services remain unchanged until the bigd daemon is restarted.

In addition, it is possible to run the bigd process in debug mode. Debug logging for the bigd process is extremely verbose, as it logs multiple messages for every monitor attempt. For information about running bigd in debug mode, contact F5 Technical Support.

Supplemental Information

Was this resource helpful in solving your issue?




NOTE: Please do not provide personal information.



Incorrect answer. Please try again: Please enter the words to the right: Please enter the numbers you hear:

Additional Comments (optional)