Juju and Nagios, sittin' in a tree.. (Part 1)

Monitoring. Could it get any more nerdy than monitoring? Well I think we can make monitoring cool again...

If you're using Juju, Nagios is about to get a lot easier to leverage into your environment. Anyone who has ever tried to automate their Nagios configuration, knows that it can be daunting. Nagios is so flexible and has so many options, its hard to get right when doing it by hand. Automating it requires even more thought. Part of this is because monitoring itself is a bit hard to genercise. There are lots of types of monitors. Nagios really focuses on two of these:

Service monitoring - Make a script that pretends to be a user and see if your synthetic monitor sees what you expect.
Resource monitoring - Look at the counters and metrics afforded a user of a normal system.

The trick is, the service monitoring wants to interrogate the real services from outside of the machine, while the resource monitoring wants to see things only visible with privileged access. This is why we have NRPE, or "Nagios Remote Plugin Executor" (and NSCA, and munin, but ignore those for now). NRPE is a little daemon that runs on a server and will run a nagios plugin script, returning the result when asked by Nagios. With this you get those privileged things like how much RAM and disk space is used. Normally when you want to use Nagios, you need to sit down and figure out how to tell it to monitor all of your stuff. This involves creating generic objects, figuring out how to get your list of hosts into nagios's config files, and how to get the classifications for said hosts into nagios. Does anybody trying to make sure their pager goes off when things are broken actually want to learn Nagios? So, here's how to get Nagios in your Juju environment. First lets assume you have deployed a stack of applications.

juju deploy mysql wikidb                # single MySQL db server
juju deploy haproxy wikibalancer        # and single haproxy load balancer
juju deploy -n 5 mediawiki wiki-app     # 5 app-server nodes to handle mediawiki
juju deploy memcached wiki-cache        # memcached
juju add-relation wikidb:db wiki-app:db # use wikidb service as r/w db for app
juju add-relation wiki-app wikibalancer # load balance wiki-app behind haproxy
juju add-relation wiki-cache wiki-app   # use wiki-cache service for wiki-app

This gives one a nice stack of services that is pretty common in most applications today, with a DB and cache for persistent and ephemeral storage and then many app nodes to scale the heavy lifting.

Now you have your app running, but what about when it breaks? How will you find out? Well this is where Nagios comes in:

juju deploy nagios                          # custom nagios charm
juju add-relation nagios wikidb             # monitor wikidb via nagios
juju add-relation nagios wiki-app           # ""
juju add-relation nagios wikibalancer       # ""

You now should have nagios monitoring things. You can check it out by exposing it and then browsing to the hostname of the nagios instance at 'http://x.x.x.x/nagios3'. You can find out the password for the 'nagiosadmin' user by catting a file that the charm leaves for this purpose:

juju ssh nagios/0 sudo cat /var/lib/juju/nagios.passwd

Now, the checks are very sparse at the moment. This is because we have used the generic monitoring interface which can just monitor the basic things (SSH, ping, etc). We can add some resource monitoring by deploying NRPE:

juju deploy nrpe                          # create a subordinate NRPE service
juju add-relation nrpe wikibalancer       # Put NRPE on wikibalancer
juju add-relation nrpe wiki-app           # Put NRPE on wiki-app
juju add-relation nrpe:monitors nagios:monitors # Tells Nagios to monitor all NRPEs

Now we will get memory stats, root filesystem, etc.

You may have noticed we left off wikidb, that is because it will show you an ambiguous relation warning when you try this:

juju add-relation nrpe wikidb # Put NRPE on wikidb

ERROR Ambiguous relation 'nrpe mysql'; could refer to:
  'nrpe:general-info mysql:juju-info' (juju-info client / juju-info server)
  'nrpe:local-monitors mysql:local-monitors' (local-monitors client / local-monitors server)

This is because mysql has special support to be able to specify its own local monitors in addition to those in the usual basic group (more on this in part 2). To get around this we use:

juju add-relation nrpe:local-monitors wikidb:local-monitors

This is a perfect example of how Juju's encapsulation around services pays off for re-usability. By wrapping a service like Nagios in a charm, we can start to really develop a set of best practices for using that service and collaborate around making it better for everyone.

Of course, Chef and Puppet users can get this done with existing Nagios modules. Puppet, in particular, has really great Nagios support. However, I want to take a step back and explain why I think Juju has a place along side those methods and will accelerate systems engineering in new directions.

While there is some level of encapsulation in the methods that Chef and Puppet put forth, they're not fully encapsulated in the way that they interact with other components in a Chef or Puppet system. In most cases, you still have to edit your own service configs to add specific Nagios integration. This works for the custom case, but it does not make it easy for users to collaborate on the way to deploy well known systems. It will also be hard to swap out components for new, better methods as they emerge. Every time you mention Nagios in your code, you are pushing Nagios deeper into your system engineering.

With the method I've outlined above, any charmed service can be monitored for basic stats (including the 80 or so that are in the official charm store). You might ask though, what about custom Nagios plugins, or specifying more elaborate but somewhat generic service checks. That is all coming. I will show some examples in my next post about this. I will also go on later to show how Nagios + NRPE can be replaced with collectd, or some other system, without changing the charms that have implemented rich monitoring support.

So, while this at least starts to bring the official Nagios charm up to par with configuration management's rich Nagios ability, it also sets the stage for replacing Nagios with other things. The key difference here is that as you'll see in the next few parts, none of the charms will have to mention "Nagios". They'll just describe what things to monitor, and Nagios, Collectd, or whatever other system you have in place will find a way to interpret that and monitor it.