Nagios has the - very useful - feature of "parent hosts". If it deems a host A being down, it first checks its parent host, B, and reports A only as down if B is up. This goes back recursively until a host with state "up" is found and only the first "down" host is actually reported. This keeps on-call people from being bombed with alerts in case of major network outages and makes sure that the alerts that are actually sent out do reasonably accurately describe the actual outage.
As an individual who has some "external" servers in various data centers on the Internet, I would like to not be alerted multiple times that my servers at ISP C, D, and E are down if there is an outage at the ISP F hosting my Nagios installation or at one of the various exchange points temporarily rendering the servers unreachable (without me being able to do anything).
The solution sounds easy but is surprisingly hard.
How one could solve the issue
For each host in the manual nagios configuration, do a traceroute and generate host stanzas for each of the hosts found:
This sounds easy and surely can be accomplished in a few hundred lines of perl. However, this makes it necessary to parse Nagios configuration if one wants to obtain the list of configured hosts directly from there. Unfortunately, the syntax of Nagios configuration is "historically grown" which makes it quite illogical and awfully hard to parse - this would easily double the size of the traceroute-based configuration generator.$ sudo traceroute -I icmp -n -A 184.108.40.206 traceroute to 220.127.116.11 (18.104.22.168), 64 hops max, 28 byte packets 1 <snip> 2 22.214.171.124 [AS3320] 1 ms 0 ms 0 ms 3 126.96.36.199 [AS3320] 1 ms 2 ms 2 ms 4 188.8.131.52 [AS3320] 51 ms 184.108.40.206 [AS3320] 3 ms 220.127.116.11 [AS3320] 60 ms 5 18.104.22.168 [AS3320] 3 ms 3 ms 3 ms 6 22.214.171.124 [AS24940] 6 ms 7 ms (TOS=238!) 8 ms $
Additionally, the (real) traceroute given above shows three other issues that might arise from such a setup.
Network paths are not as static as a traceroute suggests. On the Internet, the ISPs use BGP to connect their networks together and to ensure connectivity. In case of network reconfiguration (which may be caused by desired technical or policy changes or actual failures), the traceroute to a given network host may change. Considerable experience is needed to judge whether a change in the traceroute is only short-term while one looks at it, or long term meant to stay. Of course, only long term changes should be reflected to Nagios configuration. The possibility of changes makes it necessary to distinguish between "manually" and "automatically" generated Nagios host entries and to have a mechanism to quickly re-generate the "automatic" entries according to the current situation found on the network.
Alternative Network Paths
If you look at hop 4 of the traceroute given above, you see two IP addresses. This is a common "one packet left, one packet right" setup where two identical lines are sharing the load and this one sees different interfaces of the next hop router. Which IP address should one put into the parent host declaration then? Later Nagios versions supposedly are able to handle multiple parents to a host, but the docs of course don't cover this complex configuration and I didn't yet find the time to find out how this mechanism works and whether it can be used to represent a network layout like that in Nagios.
For some network operators, ICMP echo request packets (ping) pose a significant load on their network equipment. To remedy that load, they have disabled ICMP echo reply transmission on these machines to keep the load down. This makes the router still show up in a traceroute (as this is done by sending packets with short TTL values and making them expire in transit, resulting in an ICMP TTL exceeded message being sent back), but it cannot be pinged.
Hop Number 5 on the traceroute above is such a box: It shows up fine in the traceroute, but pinging it will always return "DOWN". Currently, I simply omit such hosts from the Nagios configuration, but this may result in extra and inaccurate alerts once the unpingable host itself fails. A different possibility would be to check the host's availabilty by sending an appropriately TTLed packet either to the host itself or to a host that is "behind" that host. The latter introduces even more complexity since one needs to find out a host "behind" the unpingable box, but this may be necessary since in the case of Hop Number 5 above, the IP address of the hop does not even seem to be in AS3320's internal routing and tracing to the hop itself stops well before the expected place of the host regardless of from where one traces.
This is a surprisingly hard issue if one wants to do network monitoring while generating accurate alerts. Surely, it's a topic that needs to be addressed in network monitoring, so I am quite interested in how other people tackle this. Please comment! Thanks in advance.