Skip to content

Nagios, Parent Hosts, and traceroute on the Internet

Nagios has the - very useful - feature of "parent hosts". If it deems a host A being down, it first checks its parent host, B, and reports A only as down if B is up. This goes back recursively until a host with state "up" is found and only the first "down" host is actually reported. This keeps on-call people from being bombed with alerts in case of major network outages and makes sure that the alerts that are actually sent out do reasonably accurately describe the actual outage.

As an individual who has some "external" servers in various data centers on the Internet, I would like to not be alerted multiple times that my servers at ISP C, D, and E are down if there is an outage at the ISP F hosting my Nagios installation or at one of the various exchange points temporarily rendering the servers unreachable (without me being able to do anything).

The solution sounds easy but is surprisingly hard.

How one could solve the issue

For each host in the manual nagios configuration, do a traceroute and generate host stanzas for each of the hosts found:

$ sudo traceroute -I icmp -n -A 213.239.240.200
traceroute to 213.239.240.200 (213.239.240.200), 64 hops max, 28 byte packets
 1  <snip>
 2  217.243.221.150 [AS3320]  1 ms  0 ms  0 ms
 3  62.154.10.106 [AS3320]  1 ms  2 ms  2 ms
 4  217.239.40.226 [AS3320]  51 ms 217.239.40.234 [AS3320]  3 ms 217.239.40.226 [AS3320]  60 ms
 5  193.159.226.2 [AS3320]  3 ms  3 ms  3 ms
 6  213.239.240.200 [AS24940]  6 ms  7 ms (TOS=238!)  8 ms
$
This sounds easy and surely can be accomplished in a few hundred lines of perl. However, this makes it necessary to parse Nagios configuration if one wants to obtain the list of configured hosts directly from there. Unfortunately, the syntax of Nagios configuration is "historically grown" which makes it quite illogical and awfully hard to parse - this would easily double the size of the traceroute-based configuration generator.

Additionally, the (real) traceroute given above shows three other issues that might arise from such a setup.

Dynamic Routing

Network paths are not as static as a traceroute suggests. On the Internet, the ISPs use BGP to connect their networks together and to ensure connectivity. In case of network reconfiguration (which may be caused by desired technical or policy changes or actual failures), the traceroute to a given network host may change. Considerable experience is needed to judge whether a change in the traceroute is only short-term while one looks at it, or long term meant to stay. Of course, only long term changes should be reflected to Nagios configuration. The possibility of changes makes it necessary to distinguish between "manually" and "automatically" generated Nagios host entries and to have a mechanism to quickly re-generate the "automatic" entries according to the current situation found on the network.

Alternative Network Paths

If you look at hop 4 of the traceroute given above, you see two IP addresses. This is a common "one packet left, one packet right" setup where two identical lines are sharing the load and this one sees different interfaces of the next hop router. Which IP address should one put into the parent host declaration then? Later Nagios versions supposedly are able to handle multiple parents to a host, but the docs of course don't cover this complex configuration and I didn't yet find the time to find out how this mechanism works and whether it can be used to represent a network layout like that in Nagios.

Unpingable Routers

For some network operators, ICMP echo request packets (ping) pose a significant load on their network equipment. To remedy that load, they have disabled ICMP echo reply transmission on these machines to keep the load down. This makes the router still show up in a traceroute (as this is done by sending packets with short TTL values and making them expire in transit, resulting in an ICMP TTL exceeded message being sent back), but it cannot be pinged.

Hop Number 5 on the traceroute above is such a box: It shows up fine in the traceroute, but pinging it will always return "DOWN". Currently, I simply omit such hosts from the Nagios configuration, but this may result in extra and inaccurate alerts once the unpingable host itself fails. A different possibility would be to check the host's availabilty by sending an appropriately TTLed packet either to the host itself or to a host that is "behind" that host. The latter introduces even more complexity since one needs to find out a host "behind" the unpingable box, but this may be necessary since in the case of Hop Number 5 above, the IP address of the hop does not even seem to be in AS3320's internal routing and tracing to the hop itself stops well before the expected place of the host regardless of from where one traces.

Conclusion

This is a surprisingly hard issue if one wants to do network monitoring while generating accurate alerts. Surely, it's a topic that needs to be addressed in network monitoring, so I am quite interested in how other people tackle this. Please comment! Thanks in advance.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

Np237 on :

The notion of dependencies in complex networks is absolutely horrible to tackle with, yet impossible to omit.

The approach we use is the Vigilo[0] correlator. It is an external daemon to which several Nagios instances forward their host and service alerts. It then is able to find which alerts are actually relevant by knowing the topology of the network (including multiple paths between hosts).

The real troublesome part is how to define this topology. It is a tedious work for which there is no automated discovery approach that reliably works. It can become even more complicated if you monitor network devices by administration ports which are different from the ports the actual traffic goes through. Establishing the correlation rules for a given network takes therefore quite some time, and it is only useful for complicated networks.

[0] http://www.projet-vigilo.org/

omar on :

Hello, everybody. This comment is only to help people searching for "multiple parents" hosts in nagios. Searching at the sources, I have found and proved that in nagios-3.1.0 you can specify multiple parents for a host (using the variable "parents") separating them with a comma (",") without spaces.

Greetings from Spain,

Omar

Marc 'Zugschlus' Haber on :

Did documentation about this appear in the mean time? Will a child host be alerted if only one of the parents is down, or is it necessary for both parents to be down?

Kevin sullivan on :

We set Google.com as the parent host -- and then tell it to not alert us if Google.com goes down. That way, at least we won't be alerted in a full network outage -- we still will hear if the wrong peer goes down, though.

Marc 'Zugschlus' Haber on :

I have in the mean time resorted to using an appropriately configured check_multi which will go WARNING if more than n hosts on the Internet get unreachable and CRITICAL if that n gets even higher.

I will blog a configuration example in due time.

Add Comment

Markdown format allowed
Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
Form options