CF Web>CscfSpecific>NagiosMonitoring (revision 24)~~EditAttach~~

Nagios Monitoring Tool

Nagios Monitoring Tool

Version 1.8 has the notes from setting things up "the old way", ie with 1.x running on oates. Herewith follows The New Way, nagios 3 on nagios.cscf.

CSCF Research group test installation

Currently the machine watcher202.cscf.uwaterloo.ca is running a nagios monitoring system. But we have a CNAME nagios.cscf.uwaterloo.ca too, which points at it as well now too.

You can look at it here - it'll ask for UWDir authentication.

Installation notes

watcher202 runs FreeBSD, currently 8.1. nagios was installed from ports, something like the procedure here

. Note that FreeBSD is not currently xhiered, but it does obey some xhier rules regarding who can log in from where.

Config files

Config files are stored in /usr/local/etc/nagios. One should, of course, RTFM

before trying to configure much.

Hosts and services

All hosts and services are defined under the hosts directory. This directory contains separate sub-directories for the different CSCF groups (research, infrastructure, etc.) and these, in turn, include separate sub-directories for the different machine groups. A group directory contains the group configuration file, and a separate configuration file for each of the machines in the group. Each of the latter contains a host object and, optionally, several service objects.

Example: The database research group directory structure:

/usr/local/etc/nagios
  |- hosts
    |- research
      |- DB
        |- db.cfg           (group configuration file)
        |- nimbus.cs.cfg    (host+services configuration file)
        |- softbase.cs.cfg  (host+services configuration file)

Note: The previous, simple structure with a single hosts.cfg file is now deprecated.

Every host object inherits from the host template defined in misc.cfg. Similarly, every service object inherits from the service template in the same file.

Other major files for CSCF editing purposes are:

contacts.cfg
- list of people who can be contacted plus how and when to contact them, as well as groups
misc.cfg
- command and template definitions

Accessing nagios

become root on cscf.cs
ssh nagios.cscf
set LOGNAME=youruserid for RCS purposes
set USER=youruserid for RCS purposes

Adding a new service on a previously defined host

Note: The previous, simple structure with a single services.cfg file is now deprecated.

See: NagiosMonitoring#Config_files

Here's an example of adding a new service to monitor, assuming that nagios already does something with the machine. (We'll talk about adding a new machine later.)

ssh to nagios.cscf
sudo to root. If you don't have sudo access, talk to DawnKeenan or DaveGawley.
Set your LOGNAME and USER variables for RCS purposes, then change to the directory where config files are stored (/usr/local/etc/nagios)
Check out and edit the file services.cfg.
As a minimum, for most Unix hosts you will want to do an ssh test:

define service{           
 use generic-service     
 host_name softbase.math     
 service_description SSH
 is_volatile 0
 contact_groups cscf-rg
 check_command check_ssh
}

Note that you must use the host_name, not the alias
The scripts for the services in check_command can be found in: /usr/local/libexec/nagios
Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg - it should report zero errors. If it reports any, fix them all.
Once that's satisfied, check your changes in and restart nagios: /usr/local/etc/rc.d/nagios restart.

You'll want to stick around long enough to make sure your changes don't cause problems.

Adding a new host

Note: The previous, simple structure with a single hosts.cfg file is now deprecated.

See: NagiosMonitoring#Config_files

ssh to nagios.cscf
sudo to root.
Make sure you can ping the host you wish to monitor from this machine. If you can't, the service checks will automatically fail because the first check that Nagios does is a check-host-alive. There may be a way around this, but we haven't worked it out yet.
Set your LOGNAME and USER variables for RCS purposes, then change to the directory where config files are stored (/usr/local/etc/nagios).
Check out and edit the file hosts.cfg.
- add an entry similar to the following:

define host{
 use generic-host
 host_name zonker
 alias zonker
 address 129.97.74.66
 contact_groups cscf-rsg
}

- "use generic-host" tells it to use the generic host definition at the top of the file for default configuration options

Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg - it should report zero errors. If it reports any, fix them all.
Once that's satisfied, check your changes in and restart nagios: /usr/local/etc/rc.d/nagios restart.

Adding a new host group

Note: The previous, simple structure with a single hosts.cfg file is now deprecated.

See: NagiosMonitoring#Config_files

ssh to nagios.cscf
sudo to root.
Set your LOGNAME and USER variables for RCS purposes, then change to the directory where config files are stored (/usr/local/etc/nagios).
Check out and edit the file hosts.cfg.
- add an entry similar to the following:

# CS core servers
define hostgroup{
 hostgroup_name cscore
 alias CS core servers
 contact_groups cscf-csi
 members fe02.math,hopper.math,barbarus.cs,cpu102.cs,cpu104.cs,cpu106.cs,cpu108.cs,cpu110.cs,cpu112.cs,cpu114.cs
}

Note that you have to have previously defined the host earlier in hosts.cfg and you must use host_name, not the alias
Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg - it should report zero errors. If it reports any, fix them all.
Once that's satisfied, check your changes in and restart nagios: /usr/local/etc/rc.d/nagios restart.

Adding a contact

ssh to nagios.cscf
sudo to root.
Set your LOGNAME and USER variables for RCS purposes, then change to the directory where config files are stored (/usr/local/etc/nagios).
Check out and edit the file contacts.cfg.
- add an entry similar to the following:

define contact{
 contact_name lfolland
 alias Lawrence Folland
 service_notification_period 24x7
 host_notification_period 24x7
 service_notification_options w,u,c,r
 host_notification_options d,u,r
 service_notification_commands notify-by-email
 host_notification_commands host-notify-by-email
 email lfolland@cs.uwaterloo.ca
}

Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg - it should report zero errors. If it reports any, fix them all.
Once that's satisfied, check your changes in and restart nagios: /usr/local/etc/rc.d/nagios restart.

Adding a contact group

First make sure that all of the contacts that you will list have been added individually (see above)
ssh to nagios.cscf
sudo to root.
Set your LOGNAME and USER variables for RCS purposes, then change to the directory where config files are stored (/usr/local/etc/nagios).
Check out and edit the file contacts.cfg.
- add an entry similar to the following:

define contactgroup{
 contactgroup_name cscf-rsg
 alias CSCF Research Group
 members mpatters,lfolland,magore,trg
}

Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg - it should report zero errors. If it reports any, fix them all. Especially make sure that all individual contacts added have their own entry.
Once that's satisfied, check your changes in and restart nagios: /usr/local/etc/rc.d/nagios restart.

Disabling a host from monitoring

This is useful if you will be shutting down a machine and don't want Nagios sending out lots of email about it being down!

Removing a host from monitoring

in hosts.cfg
- remove the entry for the machine name
- remove any references to it in any groups
in services.cfg
- remove all services being monitored for that machine
Check your Nagios config
Restart Nagios

Checking Apache virtual hosts

The easiest way to do this is to define the virtual host to be the same as the "real" host, but with a different name. For instance:

define host{
 use generic-host
 host_name softbase.math
 alias softbase
 address softbase.math.uwaterloo.ca
 contact_groups cscf-rsg
}

define host{
 use generic-host
 host_name db
 alias db
 address db.uwaterloo.ca
 contact_groups cscf-rsg
}

Then you can monitor the Apache virtual host db.uwaterloo.ca like this:

define service{
 use generic-service
 host_name db
 service_description DBWEB
 is_volatile 0
 contact_groups cscf-rsg
 check_command check_http
}

This likely results in double-pinging hosts though, there may be a better way to do it.

Checking services for hosts you can't ping

Some people can't or won't pass ICMP echoes through their firewalls. One such example is zonker.

define host{
 use generic-host
 host_name zonker
 alias zonker
 address zonker.cs.uwaterloo.ca
 check_command check_none
 contact_groups cscf-rsg
}

Here, the key is check_command. Now, looking in checkcommands.cfg :

define command{
 command_name check_none
 command_line $USER1$/check_dummy 0
}

This will always return "OK", so nagios thinks the machine is always up.

Check your config

Test the config file like this: /usr/local/bin/nagios -v /usr/local/etc/nagios/nagios.cfg
- it should report zero errors. If it reports any, fix them all.

Restart Nagios

First check your config (see above)
Check your changes in:
- ci -u hosts.cfg
- ci -u services.cfg
- ci -u contacts.cfg
Restart nagios: /usr/local/etc/rc.d/nagios restart

-- MikePatterson - 21 Feb 2005, 09 May 2005 (with help from LawrenceFolland), 21 April 2006

Topic revision: r24 - 2012-04-20 - LawrenceFolland

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit