analysis.cfg - Configuration file for the xymond_client module
~Xymon/server/etc/analysis.cfg
The analysis.cfg file controls what color is assigned to the status-messages
that are generated from the Xymon client data - typically the cpu, disk,
memory, procs- and msgs-columns. Color is decided on the basis of some
settings defined in this file; settings apply to specific hosts through
a set of
rules.
Note: This file is only used on the Xymon server - it is not used by the Xymon
client, so there is no need to distribute it to your client systems.
Blank lines and lines starting with a hash mark (#) are treated as comments and
ignored.
LOAD warnlevel paniclevel
If the system load exceeds "warnlevel" or "paniclevel", the
"cpu" status will go yellow or red, respectively. These are decimal
numbers.
Defaults: warnlevel=5.0, paniclevel=10.0
UP bootlimit toolonglimit [color]
The cpu status goes yellow/red if the system has been up for less than
"bootlimit" time, or longer than "toolonglimit". The time
is in minutes, or you can add h/d/w for hours/days/weeks - eg. "2h"
for two hours, or "4w" for 4 weeks.
Defaults: bootlimit=1h, toolonglimit=-1 (infinite), color=yellow.
CLOCK max.offset [color]
The cpu status goes yellow/red if the system clock on the client differs more
than "max.offset" seconds from that of the Xymon server. Note that
this is not a particularly accurate test, since it is affected by network
delays between the client and the server, and the load on both systems. You
should therefore not rely on this being accurate to more than +/- 5 seconds,
but it will let you catch a client clock that goes completely wrong. The
default is NOT to check the system clock.
NOTE: Correct operation of this test obviously requires that the system
clock of the Xymon server is correct. You should therefore make sure that the
Xymon server is synchronized to the real clock, e.g. by using NTP.
Example: Go yellow if the load average exceeds 5, and red if it exceeds 10.
Also, go yellow for 10 minutes after a reboot, and after 4 weeks uptime.
Finally, check that the system clock is at most 15 seconds offset from the
clock of the Xymon system and go red if that is exceeded.
-
LOAD 5 10
UP 10m 4w yellow
CLOCK 15 red
DISK filesystem warnlevel paniclevel
DISK filesystem IGNORE
INODE filesystem warnlevel paniclevel
INODE filesystem IGNORE
If the utilization of "filesystem" is reported to exceed
"warnlevel" or "paniclevel", the "disk" status
will go yellow or red, respectively. "warnlevel" and
"paniclevel" are either the percentage used, or the space available
as reported by the local "df" command on the host. For the latter
type of check, the "warnlevel" must be followed by the letter
"U", e.g. "1024U".
The special keyword "IGNORE" causes this filesystem to be ignored
completely, i.e. it will not appear in the "disk" status column and
it will not be tracked in a graph. This is useful for e.g. removable devices,
backup-disks and similar hardware.
"filesystem" is the mount-point where the filesystem is mounted, e.g.
"/usr" or "/home". A filesystem-name that begins with
"%" is interpreted as a Perl-compatible regular expression; e.g.
"%^/oracle.*/" will match any filesystem whose mountpoint begins
with "/oracle".
"INODE" works identical to "DISK", but uses the count of
i-nodes in the filesystem instead of the amount of disk space.
Defaults DISK: warnlevel=90%, paniclevel=95% Defaults INODE: warnlevel=70%,
paniclevel=90%
MEMPHYS warnlevel paniclevel
MEMACT warnlevel paniclevel
MEMSWAP warnlevel paniclevel
If the memory utilization exceeds the "warnlevel" or
"paniclevel", the "memory" status will change to yellow or
red, respectively. Note: The words "PHYS", "ACT" and
"SWAP" are also recognized.
Example: Go yellow if more than 20% swap is used, and red if more than 40% swap
is used or the actual memory utilisation exceeds 90%. Don't alert on physical
memory usage.
-
MEMSWAP 20 40
MEMACT 90 90
MEMPHYS 101 101
Defaults:
-
MEMPHYS warnlevel=100 paniclevel=101 (i.e. it will never go red).
MEMSWAP warnlevel=50 paniclevel=80
MEMACT warnlevel=90 paniclevel=97
PROC processname minimumcount maximumcount color [TRACK=id] [TEXT=text]
The "ps" listing sent by the client will be scanned for how many
processes containing "processname" are running, and this is then
matched against the min/max settings defined here. If the running count is
outside the thresholds, the color of the "procs" status changes to
"color".
To check for a process that must NOT be running: Set minimum and maximum to 0.
"processname" can be a simple string, in which case this string must
show up in the "ps" listing as a command. The scanner will find a
ps-listing of e.g. "/usr/sbin/cron" if you only specify
"processname" as "cron". "processname" can also
be a Perl-compatiable regular expression, e.g. "%java.*inst[0123]"
can be used to find entries in the ps-listing for "java -Xmx512m
inst2" and "java -Xmx256 inst3". In that case,
"processname" must begin with "%" followed by the regular
expression. Note that Xymon defaults to case-insensitive pattern matching; if
that is not what you want, put "(?-i)" between the "%" and
the regular expression to turn this off. E.g. "%(?-i)HTTPD" will
match the word HTTPD only when it is upper-case.
If "processname" contains whitespace (blanks or TAB), you must enclose
the full string in double quotes - including the "%" if you use
regular expression matching. E.g.
PROC "%xymond_channel --channel=data.*xymond_rrd" 1 1 yellow
or
PROC "java -DCLASSPATH=/opt/java/lib" 2 5
You can have multiple "PROC" entries for the same host, all of the
checks are merged into the "procs" status and the most severe check
defines the color of the status.
The optional
TRACK=id setting causes Xymon to track the number of
processes found in an RRD file, and put this into a graph which is shown on
the "procs" status display. The
id setting is a simple text
string which will be used as the legend for the graph, and also as part of the
RRD filename. It is recommended that you use only letters and digits for the
ID.
Note that the process counts which are tracked are only performed once when the
client does a poll cycle - i.e. the counts represent snapshots of the system
state, not an average value over the client poll cycle. Therefore there may be
peaks or dips in the actual process counts which will not show up in the
graphs, because they happen while the Xymon client is not doing any polling.
The optional
TEXT=text setting is used in the summary of the
"procs" status. Normally, the summary will show the
"processname" to identify the process and the related count and
limits. But this may be a regular expression which is not easily recognizable,
so if defined, the
text setting string will be used instead. This only
affects the "procs" status display - it has no effect on how the
rule counts or recognizes processes in the "ps" output.
Example: Check that "cron" is running:
PROC cron
Example: Check that at least 5 "httpd" processes are running, but not
more than 20:
PROC httpd 5 20
Defaults:
mincount=1, maxcount=-1 (unlimited), color="red".
Note that no processes are checked by default.
LOG logfilename pattern [COLOR=color] [IGNORE=excludepattern] [OPTIONAL]
The Xymon client extracts interesting lines from one or more logfiles - see the
client-local.cfg(5) man-page for information about how to configure
which logs a client should look at.
The
LOG setting determine how these extracts of log entries are
processed, and what warnings or alerts trigger as a result.
"logfilename" is the name of the logfile. Only logentries from this
filename will be matched against this rule. Note that "logfilename"
can be a regular expression (if prefixed with a '%' character).
"pattern" is a string or regular expression. If the logfile data
matches "pattern", it will trigger the "msgs" column to
change color. If no "color" parameter is present, the default is to
go "red" when the pattern is matched. To match against a regular
expression, "pattern" must begin with a '%' sign - e.g
"%WARNING|NOTICE" will match any lines containing either of these
two words. Note that Xymon defaults to case-insensitive pattern matching; if
that is not what you want, put "(?-i)" between the "%" and
the regular expression to turn this off. E.g. "%(?-i)WARNING" will
match the word WARNING only when it is upper-case.
"excludepattern" is a string or regular expression that can be used to
filter out any unwanted strings that happen to match "pattern".
The
OPTIONAL keyword causes the check to be skipped if the logfile does
not exist.
Example: Trigger a red alert when the string "ERROR" appears in the
"/var/log/syslog" file:
LOG /var/log/syslog ERROR
Example: Trigger a yellow warning on all occurrences of the word
"WARNING" or "NOTICE" in the "daemon.log" file,
except those from the "lpr" system:
LOG /var/log/daemon.log %WARNING|NOTICE COLOR=yellow IGNORE=lpr
Defaults:
color="red", no "excludepattern".
Note that no logfiles are checked by default. Any log data reported by a client
will just show up on the "msgs" column with status OK (green).
FILE filename [color] [things to check] [OPTIONAL] [TRACK]
DIR directoryname [color] [size<MAXSIZE] [size>MINSIZE] [TRACK]
These entries control the status of the "files" column. They allow you
to check on various data for files and directories.
filename and
directoryname are names of files or directories, with
a full path. You can use a regular expression to match the names of files and
directories reported by the client, if you prefix the expression with a '%'
character.
color is the color that triggers when one or more of the checks fail.
The
OPTIONAL keyword causes this check to be skipped if the file does not
exist. E.g. you can use this to check if files that should be temporary are
not deleted, by checking that they are not older than the max time you would
expect them to stick around, and then using OPTIONAL to ignore the state where
no files exist.
The
TRACK keyword causes the size of the file or directory to be tracked
in an RRD file, and presented in a graph on the "files" status
display.
For files, you can check one or more of the following:
- noexist
- triggers a warning if the file exists. By default, a
warning is triggered for files that have a FILE entry, but which do not
exist.
- ifexist
- only checks the file if it exists. If the file is reported
as missing by the client, it is ignored.
- type=TYPE
- where TYPE is one of "file", "dir",
"char", "block", "fifo", or
"socket". Triggers warning if the file is not of the specified
type.
- ownerid=OWNER
- triggers a warning if the owner does not match what is
listed here. OWNER is specified either with the numeric uid, or the user
name.
- groupid=GROUP
- triggers a warning if the group does not match what is
listed here. GROUP is specified either with the numeric gid, or the group
name.
- mode=MODE
- triggers a warning if the file permissions are not as
listed. MODE is written in the standard octal notation, e.g.
"644" for the rw-r--r-- permissions.
- size<MAX.SIZE and size>MIN.SIZE
- triggers a warning it the file size is greater than
MAX.SIZE or less than MIN.SIZE, respectively. For filesizes, you can use
the letters "K", "M", "G" or "T"
to indicate that the filesize is in Kilobytes, Megabytes, Gigabytes or
Terabytes, respectively. If there is no such modifier, Kilobytes is
assumed. E.g. to warn if a file grows larger than 1MB, use
size<1024M.
- mtime>MIN.MTIME mtime<MAX.MTIME
- checks how long ago the file was last modified (in
seconds). E.g. to check if a file was updated within the past 10 minutes
(600 seconds): mtime<600. Or to check that a file has NOT been
updated in the past 24 hours: mtime>86400.
- mtime=TIMESTAMP
- checks if a file was last modified at TIMESTAMP. TIMESTAMP
is a unix epoch time (seconds since midnight Jan 1 1970 UTC).
- ctime>MIN.CTIME, ctime<MAX.CTIME,
ctime=TIMESTAMP
- acts as the mtime checks, but for the ctime timestamp (when
the directory entry of the file was last changed, eg. by chown, chgrp or
chmod).
- md5=MD5SUM, sha1=SHA1SUM, rmd160=RMD160SUM
- and so on for RMD160, SHA256, SHA512, SHA224, and SHA384
trigger a warning if the file checksum using the specified message digest
algorithm does not match the one configured here. Note: The
"file" entry in the client-local.cfg(5) file must specify
which algorithm to use as that is the only one that will be sent.
For directories, you can check one or more of the following:
- size<MAX.SIZE and size>MIN.SIZE
- triggers a warning it the directory size is greater than
MAX.SIZE or less than MIN.SIZE, respectively. Directory sizes are reported
in whatever unit the du command on the client uses - often KB or
diskblocks - so MAX.SIZE and MIN.SIZE must be given in the same unit.
Experience shows that it can be difficult to get these rules right. Especially
when defining minimum/maximum values for file sizes, when they were last
modified etc. The one thing you must remember when setting up these checks is
that the rules describe criteria that must be met - only when they are met
will the status be green.
So "mtime<600" means "the difference between current time and
the mtime of the file must be less than 600 seconds - if not, the file status
will go red".
PORT criteria [MIN=mincount] [MAX=maxcount] [COLOR=color] [TRACK=id]
[TEXT=displaytext]
The "netstat" listing sent by the client will be scanned for how many
sockets match the
criteria listed. The criteria you can use are:
- LOCAL=addr
- "addr" is a (partial) local address specification
in the format used on the output from netstat.
- EXLOCAL=addr
- Exclude certain local addresses from the rule.
- REMOTE=addr
- "addr" is a (partial) remote address
specification in the format used on the output from netstat.
- EXREMOTE=addr
- Exclude certain remote addresses from the rule.
- STATE=state
- Causes only the sockets in the specified state to be
included, "state" is usually LISTEN or ESTABLISHED but can be
any socket state reported by the clients "netstat" command.
- EXSTATE=state
- Exclude certain states from the rule.
"addr" is typically "10.0.0.1:80" for the IP 10.0.0.1, port
80. Or "*:80" for any local address, port 80. Note that the Xymon
clients normally report only the numeric data for IP-addresses and
port-numbers, so you must specify the port number (e.g. "80")
instead of the service name ("www").
"addr" and "state" can also be a Perl-compatiable regular
expression, e.g. "LOCAL=%[.:](80|443)" can be used to find entries
in the netstat local port for both http (port 80) and https (port 443). In
that case, portname or state must begin with "%" followed by the
reg.expression.
The socket count found is then matched against the min/max settings defined
here. If the count is outside the thresholds, the color of the
"ports" status changes to "color". To check for a socket
that must NOT exist: Set minimum and maximum to 0.
The optional
TRACK=id setting causes Xymon to track the number of sockets
found in an RRD file, and put this into a graph which is shown on the
"ports" status display. The
id setting is a simple text
string which will be used as the legend for the graph, and also as part of the
RRD filename. It is recommended that you use only letters and digits for the
ID.
Note that the sockets counts which are tracked are only performed once when the
client does a poll cycle - i.e. the counts represent snapshots of the system
state, not an average value over the client poll cycle. Therefore there may be
peaks or dips in the actual sockets counts which will not show up in the
graphs, because they happen while the Xymon client is not doing any polling.
The
TEXT=displaytext option affects how the port appears on the
"ports" status page. By default, the port is listed with the
local/remote/state rules as identification, but this may be somewhat difficult
to understand. You can then use e.g. "TEXT=Secure Shell" to make
these ports appear with the name "Secure Shell" instead.
Defaults: mincount=1, maxcount=-1 (unlimited), color="red". Note: No
ports are checked by default.
Example: Check that the SSH daemon is listening on port 22. Track the number of
active SSH connections, and warn if there are more than 5.
PORT LOCAL=%[.:]22$ STATE=LISTEN "TEXT=SSH listener"
PORT LOCAL=%[.:]22$ STATE=ESTABLISHED MAX=5 TRACK=ssh TEXT=SSH
SVC servicename status=(started|stopped)
[startup=automatic|disabled|manual]
DS column filename:dataset rules COLOR=colorname TEXT=explanation
"column" is the statuscolumn that will be modified.
"filename" is the name of the RRD file holding the data you use for
comparison. "dataset" is the name of the dataset in the RRD file -
the "rrdtool info" command is useful when determining these.
"rules" determine when to apply the override. You can use
">", ">=", "<" or "<=" to
compare the current measurement value against one or more thresholds.
"explanation" is a text that will be shown to explain the override -
you can use some placeholders in the text: "&N" is replaced with
the name of the dataset, "&V" is replaced with the current
value, "&L" is replaced by the low threshold, "&U"
is replaced with the upper threshold.
NOTE: This rule uses the raw data value from a client to examine the rules. So
this type of test is only really suitable for datasets that are of the
"GAUGE" type. It cannot be used meaningfully for datasets that use
"COUNTER" or "DERIVE" - e.g. the datasets that are used to
capture network packet traffic - because the data stored in the RRD for
COUNTER-based datasets undergo a transformation (calculation) when going into
the RRD. Xymon does not have direct access to the calculated data.
Example: Flag "conn" status a yellow if responsetime exceeds 100 msec.
DS conn tcp.conn.rrd:sec >0.1 COLOR=yellow TEXT="Response time &V
exceeds &U seconds"
MQ_QUEUE queuename [age-warning=N] [age-critical=N] [depth-warning=N]
[depth-critical=N]
MQ_CHANNEL channelname [warning=PATTERN] [alert=PATTERN]
This is a set of checks for checking the health of IBM MQ message-queues. It
requires the "mq.sh" or similar collector module to run on a node
with access to the MQ "Queue Manager" so it can report the status of
queues and channels.
The MQ_QUEUE setting checks the health of a single queue: You can warn (yellow)
or alert (red) based on the depth of the queue, and/or the age of the oldest
entry in the queue. These values are taken directly from the output generated
by the "runmqsc" utility.
The MQ_CHANNEL setting checks the health of a single MQ channel: You can warn or
alert based on the reported status of the channel. The PATTERN is a normal
pattern, i.e. either a list of status keywords, or a regular expression
pattern.
If you would like to use different defaults for the settings described above,
then you can define the new defaults after a DEFAULT line. E.g. this would
explicitly define all of the default settings:
-
DEFAULT
UP 1h
LOAD 5.0 10.0
DISK * 90 95
MEMPHYS 100 101
MEMSWAP 50 80
MEMACT 90 97
All of the settings can be applied to a group of hosts, by preceding them with
rules. A rule defines of one of more filters using these keywords (note that
this is identical to the rule definitions used in the
alerts.cfg(5)
file).
PAGE=targetstring Rule matching an alert by the name of the page in
Xymon. "targetstring" is the path of the page as defined in the
hosts.cfg file. E.g. if you have this setup:
-
page servers All Servers
subpage web Webservers
10.0.0.1 www1.foo.com
subpage db Database servers
10.0.0.2 db1.foo.com
Then the "All servers" page is found with
PAGE=servers, the
"Webservers" page is
PAGE=servers/web and the "Database
servers" page is
PAGE=servers/db. Note that you can also use
regular expressions to specify the page name, e.g.
PAGE=%.*/db would
find the "Database servers" page regardless of where this page was
placed in the hierarchy.
The top-level page has a the fixed name
/, e.g.
PAGE=/ would match
all hosts on the Xymon frontpage. If you need it in a regular expression, use
PAGE=%^/ to avoid matching the forward-slash present in subpage-names.
EXPAGE=targetstring Rule excluding a host if the pagename matches.
HOST=targetstring Rule matching a host by the hostname.
"targetstring" is either a comma-separated list of hostnames (from
the hosts.cfg file), "*" to indicate "all hosts", or a
Perl-compatible regular expression. E.g.
"HOST=dns.foo.com,
www.foo.com" identifies two specific hosts;
"HOST=%
www.*.foo.com EXHOST=www-test.foo.com" matches all hosts with
a name beginning with "www", except the "www-test" host.
EXHOST=targetstring Rule excluding a host by matching the hostname.
CLASS=classname Rule match by the client class-name. You specify the
class-name for a host when starting the client through the
"--class=NAME" option to the runclient.sh script. If no class is
specified, the host by default goes into a class named by the operating
system.
EXCLASS=classname Exclude all hosts belonging to "classname"
from this rule.
DISPLAYGROUP=groupstring Rule matching an alert by the text of the
display-group (text following the group, group-only, group-except heading) in
the hosts.cfg file. "groupstring" is the text for the group,
stripped of any HTML tags. E.g. if you have this setup:
-
group Web
10.0.0.1 www1.foo.com
10.0.0.2 www2.foo.com
group Production databases
10.0.1.1 db1.foo.com
Then the hosts in the Web-group can be matched with
DISPLAYGROUP=Web, and
the database servers can be matched with
DISPLAYGROUP="Production
databases". Note that you can also use regular expressions, e.g.
DISPLAYGROUP=%database. If there is no group-setting for the host, use
"DISPLAYGROUP=NONE".
EXDISPLAYGROUP=groupstring Rule excluding a group by matching the
display-group string.
TIME=timespecification Rule matching by the time-of-day. This is
specified as the DOWNTIME time specification in the hosts.cfg file. E.g.
"TIME=W:0800:2200" applied to a rule will make this rule active only
on week-days between 8AM and 10PM.
EXTIME=timespecification Rule excluding by the time-of-day. This is also
specified as the DOWNTIME time specification in the hosts.cfg file. E.g.
"TIME=W:0400:0600" applied to a rule will make this rule exclude on
week-days between 4AM and 6AM. This applies on top of any TIME= specification,
so both must match.
For some tests - e.g. "procs" or "msgs" - the right group of
people to alert in case of a failure may be different, depending on which of
the client rules actually detected a problem. E.g. if you have PROCS rules for
a host checking both "httpd" and "sshd" processes, then
the Web admins should handle httpd-failures, whereas "sshd" failures
are handled by the Unix admins.
To handle this, all rules can have a "GROUP=groupname" setting. When a
rule with this setting triggers a yellow or red status, the groupname is
passed on to the Xymon alerts module, so you can use it in the alert rule
definitions in
alerts.cfg(5) to direct alerts to the correct group of
people.
Rules must be placed after the settings, e.g.
-
LOAD 8.0 12.0 HOST=db.foo.com TIME=*:0800:1600
If you have multiple settings that you want to apply the same rules to, you can
write the rules *only* on one line, followed by the settings. E.g.
-
HOST=%db.*.foo.com TIME=W:0800:1600
LOAD 8.0 12.0
DISK /db 98 100
PROC mysqld 1
will apply the three settings to all of the "db" hosts on week-days
between 8AM and 4PM. This can be combined with per-settings rule, in which
case the per-settings rule overrides the general rule; e.g.
-
HOST=%.*.foo.com
LOAD 7.0 12.0 HOST=bax.foo.com
LOAD 3.0 8.0
will result in the load-limits being 7.0/12.0 for the "bax.foo.com"
host, and 3.0/8.0 for all other foo.com hosts.
The entire file is evaluated from the top to bottom, and the first match found
is used. So you should put the specific settings first, and the generic ones
last.
For the LOG, FILE and DIR checks, it is necessary also to configure the actual
file- and directory-names in the
client-local.cfg(5) file. If the
filenames are not listed there, the clients will not collect any data about
these files/directories, and the settings in the analysis.cfg file will be
silently ignored.
The ability to compute file checksums with MD5, SHA1 or RMD160 should not be
used for general-purpose file integrity checking, since the overhead of
calculating these on a large number of files can be significant. If you need
this, look at tools designed for this purpose - e.g. Tripwire or AIDE.
At the time of writing (april 2006), the SHA-1 and RMD160 algorithms are
considered cryptographically safe. The MD5 algorithm has been shown to have
some weaknesses, and is not considered strong enough when a high level of
security is required.
xymond_client(8),
client-local.cfg(5),
xymond(8),
xymon(7)