A while back I was struggling with a problem with passive service check latency in a distributed Icinga environment. Latency would hover around ~1 second, which was more than acceptable in an environment with over 1,000 hosts. Every once in a while though, latency would jump up to anywhere from 60 to ~120 seconds and remain there indefinitely until the Icinga daemon was restarted.
After a bit of troubleshooting (or rather, lots!), I noticed that the problem correlated to whenever anyone ran a ‘service icinga reload’ on the central box that received checks from other Icinga instances via the nsca-ng daemon. After reviewing the Icinga source code (open source whoo!), I found the explanation.
So an Icinga service reload is similar to a restart in that config files are re-read during either operation. However, the difference is that a reload doesn’t stop and start the Icinga daemon, and more importantly, it doesn’t flush and close the named command pipe that nsca-ng writes external commands (eg. service check results) to.
The reason we got into the habit of running Icinga reloads rather than restarts was my fault really. I was under the impression that a reload was a strictly-better restart, as you don’t lose a few seconds of monitoring coverage because the daemon doesn’t need to restart. However, we switched to restarts rather than reloads after finding the above problem and latencies are now always super-low.
Still very happy with our Icinga implementation, and looking forward to what Icinga 2.0 will bring. Apparently they have re-architected the product with a proper multithreaded design (compared to the forking model of 1.x), so no more restarts to change configuration. Plus, the Icinga team promises 1 million active checks per minute. Wow. Great work guys. =D