Nagios Systems Monitoring and Reporting

CSCF uses Nagios to monitor and report on hosts. Our setup is described on this page, including extensions to Nagios to integrate with our inventory system.

CSCF Master ST items

ST#89921 and ST#95825.

Project Overview

The system that Nagios runs on is nagios.cscf.uwaterloo.ca running in a virtual container on asgard.cs.uwaterloo.ca. The system currently has NagiosQL running for configuration management. NagVis should be added in the future.

Nagios will notify the DNS Contact for inventory synchronized hosts.

General Information

The new Nagios monitoring system integrates with inventory to retrieve host information, support group classifications and default contact information. Also integrated with inventory is Nagios's list of services that can be monitored. These services can be added to hosts through inventory (see below) and are updated into Nagios immediately upon saving. For security and logistical reasons inventory has no write capabilities on the Nagios server; because of this nothing propagates either to or from inventory except during updates. Basic configuration for hosts can (and should) be done through the inventory system, for more complicated setups please use NagiosQL to perform the configuration. In general, if a configuration item is not specifically synchronized from Inventory (such as the default contact, or the hostname) it should be safe to make modifications through NagiosQL.

The CSCF Help Desk is configured to receive all notifications regardless of how the machine or service was added or who else may or may not be receiving notifications. The CSCF Help Desk is required to follow up every notification and ensure that the problem is associated with an ST and that the appropriate person is assigned the ST. Please ensure that it is clear in inventory who this person should be for each machine you add.

All configuration is stored in a database and is written out to Nagios' native configuration files on every update. If you modify a config file with NagiosQL information at the top (located in /etc/nagios3/conf.d/) your changes will be lost the next time the inventory record is saved.

The system described herein is designed for monitoring high-availability systems and machines. Any and all problems reported by it will be followed up. Please do not monitor test and workstation machines with this system. (Lab machines that need to be monitored should be added so that they are only reported on during the hours when the lab is open.)

Machine Notes

Accessing Nagios

The home page for Nagios Monitoring system can be found at https://nagios.cscf.uwaterloo.ca/ The Nagios Monitoring system provides several interfaces for interacting with it. These are as follows:

Inventory

Located at: https://cs.uwaterloo.ca/cscf/internal/inventory/web/inventory/

Provides a basic interface which should be sufficient for most common applications. On any record with an IP address and host-name, they have a "Services" tab which provides the capability to add and remove predefined services on hosts. See InventoryUserDocs for more details.

This should be used to configure most machines with relatively standard configurations.

If you need to visit the Nagios CGI interface, there is a link within the Services tab; also you can access via the Nagios front page (see directly below).

Nagios CGI Interface

Located at: https://nagios.cscf.uwaterloo.ca/nagios3/

This is the traditional Nagios interface. It should be used for viewing the status of the network, monitoring the status individual machines or services, scheduling downtime, and acknowledging issues/outages. It does not provide capability to make changes to the overall configuration. (The 'System' → 'Configuration' section is for viewing the configuration only - see below under NagiosQL Web Administration for updating the configuration.)

Following are some commonly needed Nagios tasks, and how to do them:

Displaying all details for a given host

To display all monitoring details for a given host, you can either start from Inventory in the "Services" tab and choose "Nagios Record"; or in Nagios, you can browse the list of hosts by choosing the "Hosts" item from the left-hand menu.

Displaying all hosts with a monitored service

To display all hosts with a monitored service, choose the "Host Groups" item from the left-hand menu.

NagiosQL Web Administration Panel

Located at: https://nagios.cscf.uwaterloo.ca/nagiosql32/

This provides an advanced web based configuration system that allows for extremely flexible configuration. This should be used to set up machines that require complicated or unusual monitoring requirements, to clean up the database and remove hosts that no longer exist (even if they were added through inventory), and to configure the services that are available for selection through inventory.

E-Mail

Nagios will occasionally send out email notifications. These notifications will contain links to relevant sections of the Nagios CGI interface. It is not necessary to reply to these emails; response, if necessary, should be completed through the Nagios CGI interface as well as by fixing any problems outlined in the email received.

MySQL Database and PHPMyAdmin

Located at: https://nagios.cscf.uwaterloo.ca/phpmyadmin/

This interface should not be used. It is provided for the administration of the system itself, not of the hosts being monitored. The underlying database is not particularly straightforward and is not suited for casual modification. If in doubt DO NOT TOUCH. No one will thank you when you fill their inbox up with notifications about non-existent problems, or when Nagios does not warn them of a major outage. If you really need to modify something, be very careful and MAKE A BACKUP FIRST. Also consider if there are other options.

Unless you are modifying/updating/enhancing the script that integrates with inventory there should be absolutely no reason to modify the database directly. Use NagiosQL instead.

Force Update

Located at: https://nagios.cscf.uwaterloo.ca/force-update/

This barely qualifies as an interface, but it may prove quite useful when doing large amounts of administrative work or when feeling impatient. This interface can be accessed using any of the accounts for the CGI (it uses the same authentication system) and simply allows an individual to request an immediate update to Nagios (It actually runs it while the page is loading). It is also useful in the case where something isn't working and Nagios is not getting updated as expected, as this gives the output of the script.

Inventory Services Tools

Located at: https://nagios.cscf.uwaterloo.ca/inv-services-tools/

This is a very minimalistic interface providing the capability to alter the display name of services that are configured through inventory. Since these services are all internally referred to by their display name it is necessary to update it in multiple places. This script allows you to accomplish that easily and without breaking things. Changes are immediate in inventory and NagiosQL and propagate to Nagios at the next sync.This interface also displays the contents of the note field of the host group from NagiosQL.

Nagios Restart History

This provides the capability to view the restart logs, which are kept for 288 restarts (it used to be 3 days... which was 288 restarts...).

Located at: https://nagios.cscf.uwaterloo.ca/naghistory/

There are also log files stored on the server for every API call, in /home/nagios-script/logs/

Usage Instructions

Adding a new host via Inventory

Monitoring for hosts can now be set up through the use of the inventory system. This can be achieved by visiting the inventory page for the item you wish to add services to. Once there you will find a section labelled "Services" immediately under the pre-expanded "General" section. Expanding this section reveals an interface that allows for the addition and removal of services. Simply use the provided buttons and menu in order to configure the services that are necessary for that particular host. It should be noted that the service changes are saved into inventory's database immediately though they will not be propagated to Nagios until the next update. Also of note is the set of check boxes under the "Monitoring" column, these check boxes allow for individual services to have their monitoring disabled without removing them from the host, they will however be removed from Nagios at the next update. For the purposes of setting the primary contact of the host it is only necessary to put the user id of the correct person/account into the 'dns_contact' field under the "DNS" section (currently immediately below the "Services" section). Hosts are automatically grouped according to the "Groups" field in the "Support" section, so please ensure that this is set correctly.

Each of the services available in inventory are set up to be fairly generic as there is currently no room to configure them per machine. The full list of services as well as documentation on each service is available at the Inventory Services Tools page.

Removing a host via Inventory

To remove a host through inventory it is only necessary to remove or uncheck the monitoring check box for all of its services from the services list. At the next update the host will be disabled in NagiosQL and removed from the Nagios configuration. It is only disabled and not removed, if however the machine is reactivated through NagiosQL and does not have any services from inventory on it the sync system will re-deactivate it. It should be noted that the host will no longer be kept up to date by the inventory sync system (although this is a planned feature, that all hosts would be updated from inventory)

Removing a host from Nagios via inventory does not immediately delete the hosts history, as this is stored in the event log it will be deleted as the logs are rotated (however the same schedule applies to hosts that still exist), if the host is re-added to Nagios it will be logically re-attached to any old history remaining (as long as it has not been renamed) (it is really only an appearance as Nagios simply searches the log files for occurrences of the machine's name). Similarly services can be removed and re-added without alteration to their event history as it is stored in the same set of rotated log files.

Using NagiosQL

The suggested method for adding hosts to NagiosQL is to use inventory and add any appropriate services, then use NagiosQL to add any remaining services. If a none of the services in inventory apply to the machine in question, adding the service 'Alive (ping)' will cause the host to be added to the system, without any services, at which point services can be configured on the machine (through NagiosQL) and the machine will be kept up to date with inventory through the sync system.

NagiosQL is relatively straightforward and there is a fair amount of explanation for the various options within the software itself. As such if you are looking for information on a specific option it is suggested that you view the option documentation within the NagiosQL web interface. NagiosQL is laid out in a somewhat logical manner, although the menu structure can be a little confusing at first. The Supervision section contains items relating to hosts and services; the Alerting section deals with contacts and notifications; the Commands section contains just one entry: Definitions, this is where new monitoring and notification commands are added and old ones modified. The Specialties section is for more advanced configuration of hosts and services (it exposes the ability of having certain hosts or services depend on others, allowing for complex relationships), the tools section deals with daemon and CGI config, as well as getting information from NagiosQL to Nagios. Finally the Administration section exists for the purpose of maintaining NagiosQL itself, and has no direct effect on Nagios.

Host management in NagiosQL is done through two configuration panels: Supervision → Hosts and Supervision → Host templates. These correspond directly with their equivalents in Nagios. Hosts can be added or removed as one would expect, services can be attached to them and contacts can be selected. There is also an "Active" check box, when this is checked the host will be written to the Nagios config files, when it is unchecked it will be removed from them; this behavior also applies to most other configuration objects, on a slightly different note, when Inventory "removes" a host, that is it no longer has services listed for it, then the sync script will set the host to inactive rather than delete it.

Services are similarly managed through Supervision → Services and Supervision → Service templates. These once again are directly related to the services and servicetemplates in Nagios, and behave as such.

It is important to note that the records (of any type) beginning with 'inv-' are specific to the inventory sync system, this pattern is used as an internal identifier. For contacts this means that they were added by the sync system, but for host-groups this means that they propagate to inventory. The services listed in inventory are actually host-groups in Nagios, they are the host-groups beginning with 'inv-' the name seen in inventory is the "description" field. Hosts that have these services assigned to them through the inventory system are added to these host-groups. To allow Nagios to actually monitor what these "services" represent there are a set of Nagios services that have been assigned to the correct host-groups. This does not necessarily occur in a one-to-one relationship, for instance the System Health service in inventory is the inv-health host-group which (as of August 23, 2013) contains 3 services. Similarly a service could be added to more than one host-group if it were appropriate. On a related note the sync system matches the Inventory services with the Nagios host-groups by a textual comparison, thus if you change the description of an 'inv-' host-group it will inventory will be updated to offer the new option and the old one will be removed, however existing hosts will not be migrated and will no longer have that service monitored. This is because the sync script has no way of actually knowing what the service used to be called or if it is actually a new service and the missing one was deleted. See 'Rename Inventory Service' above.

User Management

Nagios CGI Interface

Nagios now uses CAS for user authentication.

In order to administer users for the Nagios CGI Interface it is necessary to have root privileges at a shell on nagios.cscf.uwaterloo.ca

Adding a User

There are two parts to adding a user in the Nagios CGI, the first is to add a user and the second is to give the user privileges. Both require shell access to nagios-202.cscf.uwaterloo.ca Once logged in adding a user is done by editing the file /etc/nagios3/htpasswd.groups to add the username to the list.

The second part is achieved by editing the file /etc/nagios3/cgi.cfg. The following lines contain the relevant configuration values. Line numbers are correct as of April 24th, 2014.

132: authorized_for_system_information=nagiosadmin,cscf-op,cscf-adm
144: authorized_for_configuration_information=nagiosadmin,cscf-op,cscf-adm
157: authorized_for_system_commands=nagiosadmin,cscf-op,cscf-adm
170: authorized_for_all_services=nagiosadmin,cscf-op,cscf-adm
171: authorized_for_all_hosts=nagiosadmin,cscf-op,cscf-adm
184: authorized_for_all_service_commands=nagiosadmin,cscf-op,cscf-adm
185: authorized_for_all_host_commands=nagiosadmin,cscf-op,cscf-adm

Changing a User's Password

The user's password can be changed through the normal methods for changing the WatIAM password.

Removing a User

Removing a user is achieved by removing the user from the list in /etc/nagios3/htpasswd.groups

Also the user should have all privileges revoked. This can be achieved through the following two commands

sed -i 's/<username>//g' /etc/nagios3/cgi.cfg
sed -i 's/,,/,/g' /etc/nagios3/cgi.cfg

The first command removes all instances of the username from the file, and the second removes duplicate commas that might be left (if the user wasn't at the end of all the lists).

NagiosQL Web Administration Interface

NOTE: NagiosQL uses CAS for user authentication. The list of authorized users is managed through the NagiosQL web interface. Any CAS authenticated user who is not authorized by NagiosQL will be presented with the NagiosQL login page, where cscf-op and cscf-adm credentials will work.

In NagiosQL all user administration is done through the user administration panel. To access this panel log into NagiosQL as an administrator (cscf-op or cscf-adm should work just fine) and using the menu on the left, navigate to Administration → User admin.

Adding a User

To add a user click add, and fill out the fields in red, the description should either be the real name of the person (for a personal account) or the role that someone using that account would be filling (for a non-personal account). Users are automatically administrators, (well, actually there are just no access restrictions set up, if those were to be set up, then users would have to be added to specific groups to do specific things, that however is a fair amount of work to set up and maintain) if group administration privileges are desired (this will enable the user to add and remove people from groups) then put a check mark in the box labeled "Enable group administration". In order to enable CAS authentication for this user you must check the box "Web server authentication" (if you do not check this box you will have to login with NagiosQL credentials). Save the user for changes to be applied. Regardless of the setting of "Web server authentication", the user needs to be assigned a password, if the user will be authenticated with CAS ("web server authentication" is checked) then there is no need to set a password that can be remembered, it can be entirely random. Also because of the mechanisms used by NagiosQL to authenticate users, if someone were to get ahold of a username and password (even if it is supposed to be authenticated through CAS) they can log in with it. Please be sure to use a strong password. To generate a password on a Linux machine dd bs=4K count=1 if=/dev/urandom | sha256sum can be used. It may also work on Windows (via CygWin) and on Mac OS X. There should be no practical way to reproduce a password generated in this fashion.

Changing a User's Password

NagiosQL Authenticated Users

For users who are authenticated through NagiosQL (not CAS) use the procedure below. The cscf-op and cscf-adm accounts are of this type.

A user can change their own password by visiting Administration → New Password in the web interface and then filling out the details and clicking Save. Another user's password can be reset by visiting Amdinistration → User admin and clicking the wrench and screwdriver icon in the rightmost column of the user's row in the table and then typing a new password into the appropriate fields and clicking Save. (All other attributes can be changed in the same manner, simply edit the appropriate fields.)

CAS Authenticated Users"> For CAS Authenticated Users

These users' passwords can be changed through the standard mechanisms, such as WatIAM.

Removing A User

Visit Administration → User admin and click the garbage can icon on the far right in the row corresponding to the user. This works for both users athenticated by CAS and those authenticated locally.

PHPMyAdmin and MySQL

Adding a User

Navigate to Home → Privileges, click "Add a new user" and fill in a user name and password. Scroll down to Global privileges and click check all, then select "Create user". Click the "Edit privileges" link in the "Action" (last) column for the user that was just created, scroll down to "Change Login Information / Copy User" and change the host to localhost, then click the "Go" button immediately below that section.

Changing a User's Password

This must be performed for each entry that the user has (usually two), first click "Edit privileges" in the far right column, scroll down to the third section, "Change Password" and enter a new password, then click "Go" (the one immediately below that section).

Removing a User

Navigate to Home → Privileges select the two rows pertaining to the user, scroll down to the section labeled "Remove selected users" and click "Go". The user has now been completely removed from the database server.

What to do if Nagios is broken?

The question remains, what happens when Nagios breaks and Dennis is not here to fix it? Not to worry, below you'll find some documentation which describes how you can try and troubleshoot Nagios problems to find the root cause of a downtime/broken component.

Overview

It's important to understand the basic flow of Nagios data and how the various systems that make up Nagios interact.

nagios-flow.png

I'll summarize the idea that the image is trying to portray. Nagios3 is the base of the whole Nagios system and is the Nagios monitoring system itself. NagiosQL feeds Nagios3 data by creating the configuration files that Nagios3 uses. NagiosQL is a web interface that allows easy modification of the Nagios3 configuration files, and avoids the user having to manually go into the Nagios3 config files and edit them. NagiosQL also provides other features such as disabling the status of hosts, but for the purpose of this chart, NagiosQL provides Nagios3 the configuration files it requires to enable monitoring of devices. The NagiosAPI sends data to NagiosQL and modifies the NagiosQL database based on the API requests it receives. The Nagios API is simply a mechanism to interact with NagiosQL programmatically. Inventory works in both directions with the Nagios API in order to send data to Nagios, but the Nagios API also works in reverse and gets JSON data from the Inventory system to get details on a machine so that Nagios has the correct data filled in.

When inventory makes any kind of change to NagiosQL via the API a reload file is created which indicates to Nagios3 that it should restart and load the new configuration files. Every 1 minute a cronjob checks if this file exists and will restart if it exists and delete the file. This mechanism is used to avoid having to restart Nagios every x interval, and instead only restarts when changes are made that would require a Nagios restart.

First steps

You may be asking; what is the first thing that you should do when something breaks with Nagios? Your best bet is going to be to take a look at the Nagios3 service.

Checking the Nagios3 Service

When connected to the nagios system, you should check on the Nagios3 service. Nagios3 restarts all the time, so there is no harm in simply attempting a restart on it via a "sudo service nagios3 restart" (replace "restart" with a "start" if the service is not running).

If the Nagios system has problems starting during this process then it would indicate these problems clearly when you restart the service.

Looking into the Nagios3 log files

You can quickly take a look in the Nagios3 log files to see if there may be anything the restart did not show. The logs you find are location in /var/log/nagios3/. The two files to look at are the nagios.log and livestatus.log files.

Config files

At this point if anything is broken, it is very likely it is due to a broken configuration file that was provided from NagiosQL. The Nagios3 console logged messages on restart or log files should indicate what specifically is broken and you should be able to look into NagiosQL and quickly figure out what is broken. Once you fix the problem in NagiosQL you'll need to either make a change to services via Inventory (ex: Temporarly add the "Alive (ping)" service to your machine and then remove it. This will trigger a reload) or you can also just create the reload file by doing "sudo su -c "touch /var/nagios-api/reload" www-data". The reason the "www-data" part is included with the "touch" is because the file needs to be owned by a user that the web server has persmission to delete from, otherwise there won't be permissiosn to delete the file and Nagios will keep restarting. Every minute the Nagios system is running a cron checking for this file. If it exists it triggers a Nagios restart which updates the config files.

Every restart triggered by the reload file is logged in the following file: /var/www/private/nagios-restart.log. This could be a good location to check if there are any issues with Nagios not reloading new config files as this could indicate if there are any problems, including whether or not it is actually being restarted.

All of the nagios configuration files can be found in the /home/nagios-script/nagios-scripts/etc/ folder. Additionally you can find the configuration files for Nagios3 in /etc/nagios3/. This is the location which NagiosQL will write the configuration files to for Nagios3 to use.

Further debugging steps

Typically Nagios problems are caused by a configuration problem and will be fixed by a restart or a quick change to a NagiosQL config file. However sometimes these simple restarts aren't going to fix the problem or point out any obvious issues. Typically if there is not a problem at the Nagios level, then it will be some kind of problem with the API layer or NagiosQL. These kind of problems can also be troubleshooted, though they may take a little bit longer to resolve as there is a signifigant amount of debug data that may need to be considered, especially if you don't know what you're looking for.

Nagios API Logs

The Nagios API stores very extensive logs which can mean if you don't know what you're doing you could spend hours looking through the logs. To speed up this process you can use a tool named lnav. Lnav is already installed on the nagios machines including a custom definition for the nagios logs allowing a fancy feature of being able to sort by different log types. You can enter commands in lnav much like vim. You can enter ":set-min-log-level" and then hit tab and you can see an option of different log levels to sort by such as: critical, debug, error, fatal, info, trace, unknown, warning. This will sort through all the logs and display all the log messages of a specific log level which can make searching for specific issues very efficient.

There is a lot of talk about how to efficiently view the logs, but it is probably pretty important to know where the logs actually are. The logs are located at "/home/nagios-script/logs/". There is one log file for every single API request including a simple GET request. The past 3 days worth of log files are kept, and if there was an error in an API request then the file is kept for 90 days. You can open up all the logs at once by just doing "lnav /home/nagios-script/logs/" and you'll open up the whole directories logs. Even if there are a lot of logs, lnav is very efficient and should have no problem opening them all in a few seconds.

The logs here are very verbose and contain almost everything you'd want to know about the whole process from when an API call is received to completing the request.

Altering the API log verbosity

The log verbosity settings can be changed for the Nagios API in the case that you want to have more or less lines of debug data written out to the logs. You can find the config setting to change the verbosity in the main api configuration file (/home/nagios-script/nagios-scripts/etc/api.conf). Near the bottom of the file you'll find a configuration option named "log:verbosity:disk'. At the time of writing this it is set to equal 7, however you can change this to another level. The different logging levels are as follows:

Log levels  
FATAL 0
ALERT 1
CRITICAL 2
ERROR 3
WARNING 4
NOTICE 5
INFORMATIONAL 6
DEBUG 7
TRACE 8
CONFIG 9
CALL_TRACE 10
DEPTH 11

Looking at the Nagios API code

If you need to look at the Nagios API code, or make changes to it you can find the code in /home/nagios-script/nagios-scripts/bin/. Additionally the main index page that an API call initially hits is at /home/nagios-script/nagio-scripts/var-www/index.php. Within the bin folder you'll find some PHP files. There is one PHP file for every type of API request you can make such as for a Get, Fetch, Add, etc. This method makes it easy to find files you may be looking for. Alongside these files there are also library files that you'll find in here that Dennis created for the API to make use of. These can mostly be ignored unless you specifically need to modify the library, in which case I will assume you already know what you're doing.

SQL Queries in Nagios API

Sometimes you may have the need to modify an SQL query or look at them for debugging purposes. You'll find ALL of the queries the application uses in the etc/statements.conf file. Each statement uses a unique named identifer which is how the code refers to the query when it wants to use a specific query.

Additional Notes

Linking To The Nagios CGIs

Linking to the Nagios CGIs is fairly straight forward, the CGIs make use of frames which complicates things a little bit. The format to link to a specific CGI page is as follows:

<URL of Nagios Installation>?corewindow=<encoded relative url of desired page>

  • The first URL is the one that is normally used to access Nagios, for example https://nagios-202.cscf.uwaterloo.ca/nagios3/
  • The second URL is can be obtained by visiting the desired page, right-clicking on the main part and choosing "Show only this frame", the URL is now in the address bar, the domain part is not needed, for example the host overview page is: /cgi-bin/nagios3/status.cgi?hostgroup=all&style=hostdetail
  • The next step is to encode this URL, this can be done with any of the URL encoders on the internet, I've had luck with http://meyerweb.com/eric/tools/dencoder/
  • In this example it becomes: %2Fcgi-bin%2Fnagios3%2Fstatus.cgi%3Fhostgroup%3Dall%26style%3Dhostdetail
  • Putting it all together gives: https://nagios-202.cscf.uwaterloo.ca/nagios3/?corewindow=%2Fcgi-bin%2Fnagios3%2Fstatus.cgi%3Fhostgroup%3Dall%26style%3Dhostdetail

Managing Notices on the Homepage

The Nagios Monitoring system's homepage (https://nagios.cscf.uwaterloo.ca/) reads notices from a file and displays them in a notices section at the top of the page. The format for the file is as follows:

![Heading] Body of notice

If the exclamation point is present then the notice will not appear on the page, if it is not present then the part in square brackets becomes the title of the notice and the rest becomes the body. Note that after the first set of square brackets any characters are legal and will simply be sent to the browser, the only exception to this is a newline character, as this is used as a deliminator between notices.

The file is located on nagios.cscf.uwaterloo.ca at /var/www/private/notices

It should also be noted that the heading used for the title is a heading 3 (<h3></h3>)

Restarting the Nagios Daemon

If the Nagios daemon in not working for some reason (or has failed to start because of configuration issues which have since been resolved) it can be restarted with sudo service nagios3 restart or started with sudo service nagios3 start

API Program Technical Overview

The script is kept in a git repository which is checked out into /home/nagios-script/nagios-scripts/ on nagios.cscf. Within this directory are the directories bin and etc which contain the program and its configuration respectively. Inside another directory, var-www, is a file, api.php, which is the entry point for the API. Inside of the etc directory there are a bunch of configuration files. As much of the code is shared between several scripts and web frontends, their configuration is also stored here. The main script reads api.conf, which references three files: passwords.conf, main.conf, and api-statements.conf. passwords.conf contains the database information and other passswords, it should be readable only by owner and group, which should be nagios-script and nagios-data respectively. The file main.conf contains the general configuration, and is multi-script wide, it should contain sane defaults. The SQL statements are all stored in api-statements.conf; it should be noted that there are no default statements, if the program goes to use a statement and it is not provided from the configuration, it will encounter an error. Note that there should never be a need to edit the statements. If you wish to change an option for only this script, you can add it to the end of the api.conf file, after the included files, as this will cause the value to be overwritten in program when the config is loaded.

Development Notes

The Nagios system is set up on a Blue/Green deployment system. The current production server is nagios-204.cscf.uwaterloo.ca, and the development system (previously production) is nagios-202.cscf.uwaterloo.ca. The general service address of this system is nagios.cscf.uwaterloo.ca. Both machines are currently LXC virtual containers hosted on asgard.cs.uwaterloo.ca. Configuration management is handled by NagiosQL. The system is integrated and synchronized with inventory upon every inventory update of records that have services monitored.

One developer's suggested workflow is detailed in this work item.

Architecture of Collection of Scripts

The system that provides the API service for synchronizing inventory with the Nagios monitoring system and providing monitoring data to inventory is written in PHP using object oriented programming techniques and PHP database objects for database integration. The main high level logic is contained within the API, Command and its subclasses, Get, Fetch, Add, UpdateHost, UpdateService, and Delete classes stored in reasonably named files. It uses the Nagios and LiveStatus classes, which are stored within the nagios.php and LiveStatus.php files, to access (both read and write) to the databse and MK LiveStatus. These two classes contain mostly higher level operations, while low level code (such as initialization of PDO objects and statement management) is contained within the Database class (located in database.php). Once the controller classes have the data from the database it either sends to Inventory (or whatever called it), or uses it to update its own data in concert with provided external data (in the API call) and data collected from Inventory via the json action. Data is sent back to the database using the Nagios clas.

The script makes use of an external configuration file which determines the exact behaviour of the script as well as defining the SQL statements and their parameters that are used to accomplish the synchronization.

Service Status (Inventory)

  1. A service is manually added to a machine in Inventory
    • Nagios immediately records the service as being OK
  2. Nagios begins monitoring the service
    • At this point when Nagios finds an error with the service it stores the error and sends emails.
  3. On the next display/refresh of the host page in Inventory, the updated status will be shown.

MK Livestatus

MK Livestatus is a module for Nagios which provides access to many of the internal structures within Nagios. It uses a language known as Livestatus Query Language (LQL) which is somewhat modelled after SQL, to provide access to the Nagios state data. This module enables access to the entire configuration data of Nagios, including host, contact, service, user, and hostgroups amoung others. The module is accessed through a unix socket, which is located at /var/lib/nagios3/rw/live. There is a PHP wrapper script which can also be used which is at res/live.php in the git repository for this project.

There are also MKLSDriver and MKLSStatement which together provide a PDO like interface to MKLiveStatus (I would have written a PDO driver but it would have taken too long).

Environment Variables

It should be noted that when Nagios runs a check command it does not set any environment variables, this means that commands do not have access to many things which are often taken for granted (such as the home directory). Nagios also seems intent on making it very difficult to set them in the command entry. If environment variables are needed then it is likely needed to write a wrapper script which prepares the environment.(Source: http://readlist.com/lists/lists.sourceforge.net/nagios-users/2/14605.html).

Affected Commands

Commands which are known to be affected include:

  • check_mysql (and similar)
    • Looks for .my.cnf in home directory
    • Can use /etc/mysql/.cnf instead

Nagios

In order to store the documentation of the services in the NagiosQL database (needed for the new web page) the notes column of the tbl_hostgroups table needed its type changed from VARCHAR(255) to LONGTEXT, which cantain up to 4 GiB of text data.

Nagios API

An API is being implemented for Nagios. Documentation is available at NagiosAPI.

Features

  • Automatic updating of Nagios from NagiosQL
  • Interface with Inventory for host definitions and provided services
  • CAS Authentication

TODO

Nagios

  • Integrate with CAS - Done
  • SSL to allow (require) https - Done
  • get it backed up (talk to Guoxiang - gxshen) - Done
  • Show statuses of monitored services in inventory - Done
  • Only restart Nagios if configuration has actually changed. - Done
  • Setup NagVis
  • Two way sync with inventory
    • Sync of inv. hostgroups as services, support for modifications in NagiosQL
    • Allow hosts linked to inventory to be updated automatically from inventory without destroying modifications in NagiosQL
    • Support for recognizing hosts that have been added to NagiosQL and matching with inventory, allowing full sync.
      • Not likely to happen, add 'Alive (ping)' to the host in inventory, then configure in NagiosQL
    • A basic web front-end that allows for renaming of Inventory "services" and properly updates all records in Inventory. - Done
    • Batch management of services in inventory (all systems with service X listed switch to/add/remove service Y) ??? <--- Would be nice.

Inventory

  • Intelligent hiding of services section when not applicable (i.e. not a computer)
-- DennisBellinger - 2015-08-28
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatods Inv-Modifications-mockup.ods r1 manage 23.8 K 2013-08-06 - 16:47 DennisBellinger An interface mockup of the addition to inventory
PNGpng nagios-flow.png r1 manage 17.1 K 2016-04-29 - 15:06 JustinVisser Nagios flow diagram
Edit | Attach | Watch | Print version | History: r65 < r64 < r63 < r62 < r61 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r65 - 2016-05-03 - JustinVisser
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback