Monitoring Log Messages with Nagios Passive Checks

Nagios is powerful, industry-standard monitoring software that can track uptime, generate performance reports, and alert relevant personnel when something goes wrong. It’s probably the most important software I use on a daily basis.

Nagios’s primary mode of “service checking” is called an active check. Active checks reach out from the Nagios host and test if a service is doing what it’s expected to do. For example, you can use the ‘check_http’ plugin to verify that a web server is running, is returning the right text, or has an up-to-date SSL certificate. There are lots of plugins that come with a default Nagios installation that can be used to check the majority of anyone’s IT infrastructure. For checks that don’t already exist, it’s fairly easy to make your own.

Nagios also has another mode of service checking called a passive check. Passive checks rely on data coming in from an external source, like an SNMP trap or GPIO closure. Passive checks require a service outside of Nagios to detect the relevant event and then execute a script to inform Nagios; this service could be a an SNMP trap sink (snmptt) or a GPIO event listener (gpio-watch).

Passive checks can also be used to monitor log messages that pass through rsyslog, the standard syslog application on the majority of Linux-based distributions. Rsyslog has powerful filtering and templating that can be utilized to integrate with Nagios. So, how can we set this up?

The first and most important step is to decide what log messages should generate alerts. For this example, we’ll be monitoring smartd messages, the daemon in charge of monitoring SMART data on hard drives. Setup an rsyslog filtering rule for relevant messages in your rsyslog.conf file:

$ModLoad omprog
if $programname contains 'smartd' and ( \
    $rawmsg contains_i 'uncorrectable' or \
    $rawmsg contains_i 'unreadable' or \
    $rawmsg contains_i 'completed with error' or \
    $rawmsg contains_i 'error count increased' ) \ 
  and not ($rawmsg contains_i 'copyright') \
  then action(type="omprog" template="smartlogtranslator" binary="/usr/lib64/nagios/plugins/eventhandlers/smart-log-translator.pl")

Before the filtering rule, you’ll need to define a template which tells rsyslog how to format the data before handing it off to the binary listed at the end of filtering rule.

$ModLoad omtemplate
$template smartlogtranslator,"%FROMHOST%;;_;;%syslogtag%;;_;;%msg%\n"

When you combine these two blocks of text into your rsyslog configuration, rsyslog will send the string “HOSTNAME;;_;;smartd;;_;;MESSAGE” to the “smart-log-translater.pl” script via stdin where HOSTNAME and MESSAGE are the actual hostname and syslog message received by rsyslog.

The script, called an “event handler” by Nagios, should look like this:

#!/usr/bin/perl

use Switch;

my $line = <STDIN>;
chomp $line;

if ($line eq '') {
  print "Usage: echo HOSTNAME;;_;;SYSLOGTAG;;_;;MESSAGE | smart-log-translator.pl\n";
	exit;
}

my ($hostname, $syslogtag, $message) = split /;;_;;/, $line;

# grab only the short hostname, truncate everything else
$hostname =~ s/\..*//g;
# make sure the hostname is all lower case (for windows computers)
$hostname = lc $hostname;
# remove the PID from the program name
$syslogtag =~ s/\[.*\]//g;
# remove the colon from the program name
$syslogtag =~ s/:*//g;

system("/usr/lib64/nagios/plugins/eventhandlers/submit_check_result $hostname \"$syslogtag log\" 1 \"$message\"");

exit;

This relatively simple script parses the data it received from rsyslog and then submits it to the “submit_check_result” script provided by Nagios that tells Nagios which host and service the data belongs to, as well as the severity of the alert and status message from the alert, respectively. Keep in mind that the hostname and and service name must match those defined in your Nagios configuration. A “0” for severity means “OK”, whereas a “1” means warning, a “2” means critical, and a “3” means unknown.

The “submit_check_result” script should look like this:

#!/bin/sh

# SUBMIT_CHECK_RESULT
# Written by Ethan Galstad (egalstad@nagios.org)
# Last Modified: 02-18-2002
#
# This script will write a command to the Nagios command
# file to cause Nagios to process a passive service check
# result.  Note: This script is intended to be run on the
# same host that is running Nagios.  If you want to 
# submit passive check results from a remote machine, look
# at using the nsca addon.
#
# Arguments:
#  $1 = host_name (Short name of host that the service is
#       associated with)
#  $2 = svc_description (Description of the service)
#  $3 = return_code (An integer that determines the state
#       of the service check, 0=OK, 1=WARNING, 2=CRITICAL,
#       3=UNKNOWN).
#  $4 = plugin_output (A text string that should be used
#       as the plugin output for the service check)
# 
 
echocmd="/bin/echo"
 
# Make sure CommandFile here matches the 'command_file' value in nagios.cfg
CommandFile="/var/spool/nagios/cmd/nagios.cmd"
 
# get the current date/time in seconds since UNIX epoch
datetime=`date +%s`
 
# create the command line to add to the command file
cmdline="[$datetime] PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4"
 
# append the command to the end of the command file
`$echocmd $cmdline >> $CommandFile`

Now, all that’s left is to configure Nagios! You wouldn’t normally write a service definition this way, as you’d likely be using defined templates. I’m being explicit here to aid with configuration and troubleshooting.

define service{
  host_name                       host1,host2
  service_description             smartd log
  active_checks_enabled           0
  passive_checks_enabled          1
  obsess_over_service             1
  check_freshness                 0
  notifications_enabled           1
  event_handler_enabled           1
  flap_detection_enabled          0
  process_perf_data               0
  retain_status_information       1
  retain_nonstatus_information    1
  is_volatile                     0
  check_period                    24x7
  notification_period             24x7
  check_interval                  300
  retry_interval                  60
  max_check_attempts              1
  contact_groups                  admins
  notification_options            w,u,c,r,f
  notification_interval           3600
}

After this is all in place, restart Nagios. You should see the “smartd log” service in the Nagios WebUI. You can simulate a log message to test this service using the “logger” command.

logger -t smartd -- Device: /dev/sda [megaraid_disk_04] [SAT], 24 Offline uncorrectable sectors

Because we set this passive check to non-volatile (is_volatile 0), Nagios will not reset the state of the service without administrator intervention. You won’t receive reports for every subsequent message in the log for this service on this host but you will continue to get alerted per the “notification_interval” value. When you’ve corrected the issue, simply submit a successful passive check message to Nagios via the WebUI.