watchdog.conf - configuration file for the watchdog daemon
This file carries all configuration options for the Linux watchdog daemon. Each
option has to be written on a line for itself. Comments start with '#'. Blanks
are ignored except after the '=' sign. An empty text after the '=' sign
disables the feature as long as that makes sense.
- interval = <interval>
- Set the highest possible interval between two writes to the
watchdog device. The device is triggered after each check regardless of
the time it took. After finishing all checks watchdog goes to sleep for a
full cycle of <interval> seconds. Default value is 1 second. The
kernel drivers typically expects a write command every minute otherwise
the system will be rebooted. Therefore an interval of more than a minute
can only be used with the force command-line option [--force | -f].
- logtick = <logtick>
- If you enable verbose logging, a message is written into
the syslog or a logfile. While this is nice, it is not necessary to get a
message every interval which really fills up disk and needs CPU. logtick
allows adjustment of the number of intervals skipped before a log message
is written. If you use logtick = 60 and interval = 10, only every 10
minutes (600 seconds) a message is written. This may make the exact time
of a crash harder to find but greatly reduces disk usage and administrator
nerves if you're looking for a particular syslog entry in between of
watchdog messages.
- max-load-1 = <load1>
- Set the maximal allowed load average for a 1 minute span.
Once this load average is reached the system is rebooted. Default value is
0. That means the load average check is disabled. Be careful not to set
this parameter too low. To set a value less then the predefined minimal
value of 2, you have to use the -f command line option.
- max-load-5 = <load5>
- Set the maximal allowed load average for a 5 minute span.
Once this load average is reached the system is rebooted. Default value is
3/4*max-load-1. Be careful not to this parameter too low. To set a value
less then the predefined minimal value of 2, you have to use the -f
command line option.
- max-load-15 = <load15>
- Set the maximal allowed load average for a 15 minute span.
Once this load average is reached the system is rebooted. Default value is
1/2*max-load-1. Be careful not to this parameter too low. To set a value
less then the predefined minimal value of 2, you have to use the -f
command line option.
- min-memory = <minpage>
- Set the minimal amount of memory that has to stay free.
Note that this is in memory pages (4kB on x86). Default value is 0 pages
which means this test is disabled. The page size is taken from the system
include files. The usable memory is computed from MemFree + Buffers +
Cached since buffer and cache use typically expand to use most free memory
but the kernel will reclaim this as needed. NOTE: If this measure gets
below a few tens of MB then the system will page swap aggressively have
poorer file system performance due to the lack of caching. This is a
'passive' test and works by reading /proc/meminfo
- allocatable-memory = <minpage>
- Set the minimum amount of allocatable memory available on
the system. Note that this is in pages. Default value is 0 pages which
means the test is disabled. As with min-memory, the page size is taken
from the system include files. This is an 'active' test and it works by
attempting to memory-map a block of the configured size.
- max-swap = <maxpage>
- Set the maximum amount of swap use. Note that this is in
memory pages (4kB on x86). Default value is 0 pages which means this test
is disabled. Often this should be a large portion of available swap, but
remember that paging 1GB of swap can take several/tens of seconds. This is
a 'passive' test and works by reading /proc/meminfo
- watchdog-device = <device>
- Set the watchdog device name, typically /dev/watchdog.
Default is to disable keep alive support. This should be tested by running
the daemon from the command line before configuring it to start
automatically on booting.
- watchdog-refresh-use-settimeout = <auto|yes|no>
- Refresh watchdog timer by setting its timeout instead of
using a normal watchdog refresh operation. Might help if your watchdog
trips by itself when the first timeout interval elapses. Default is 'auto'
for IT87 fix-up but this can be disabled with 'no' or forced for other
modules with 'yes'.
- watchdog-refresh-ignore-errors = <yes|no>
- Ignore errors reported by writing to the watchdog device.
Typically this is used for systems that have broken implementations of the
IPMI driver to avoid a reboot loop.
- watchdog-timeout = <timeout>
- Set the watchdog device timeout during startup. If not set,
a default is used that should be set to the kernel timer margin at compile
time.
- temperature-sensor = <temp-virtual-file>
- Set the temperature sensor name. This is normally a
'virtual file' under /sys and it contains the temperature in
milli-Celsius. Usually these are generated by the sensors package,
but take care as device enumeration may not be fixed. Default is to
disable temperature checking. Multiple sensors can be used by having
repeated temperature-sensor entries. Due to the enumeration problem any
missing temp sensor is simply ignored and not treated as a reboot
trigger.
- max-temperature = <temp>
- Set the maximal allowed temperature in Celsius. Once this
temperature is reached the system is stopped. Default value is 90 C.
Watchdog will issue warnings once the temperature increases 90%, 95% and
98% of this temperature.
- temp-power-off = <yes|no>
- Set the watchdog action on overheating. Yes option
(default) is to power the machine off, no option is to halt machine and
allow Ctrl-Alt-Del reboot.
- file = <filename>
- Set file name for file mode. This option can be given as
often as you like to check several files.
- change = <mtime>
- Set the change interval time for file mode. This options
always belongs to the active filename, that is when finding a 'change ='
line watchdog assumes it belongs to the most recently read 'file =' line.
They don't necessarily have to follow each other directly. But you cannot
specify a 'change =' before a 'file ='. The default is to only stat the
file and don't look for changes. Using this feature to monitor changes in
/var/log/messages might require some special syslog daemon configuration,
e.g. rsyslog needs "$ActionWriteAllMarkMessages on" to be set to
make sure the marks are written no matter what.
- pidfile = <pidfilename>
- Set pidfile name for daemon test mode. This option can be
given as often as you like to check several daemons, assuming they write
their post-forking PID to the specified files.
- ping = <ip-addr>
- Set IPv4 address for ping mode. This option can be used
more than once to check different connections.
- ping-count = <ping-per-interval>
- Set the number of ping attempts in each 'interval' of time.
Default is 3 and it completes on the first successful ping.
- interface = <if-name>
- Set interface name for network mode. This option can be
used more than once to check different interfaces. Note it is only
possible to check physical interfaces, and not aliased IP interfaces.
- test-binary = <testbin>
- Execute the given binary to do some user defined
tests.
- test-timeout = <timeout in seconds>
- User defined tests may only run for <timeout>
seconds. Set to 0 for unlimited.
- repair-binary = <repbin>
- Execute the given binary in case of a problem instead of
shutting down the system.
- repair-timeout = <timeout in seconds>
- repair command may only run for <timeout> seconds.
Set to 0 for 'unlimited', but note that the hardware timer is not
refreshed in this case so the system will hard-reset at some point.
- retry-timeout = <timeout in seconds>
- Allow most error conditions to persist for <timeout>
seconds. Set to 0 for immediate action (like softboot behaviour).
- repair-maximum = <count>
- This allows no more then <count> repair attempts
against a given fault that report success (i.e. return 0), but fail to
clear the fault, before a reboot is initiated anyway. If set to zero then
a repairable fault can always be blocked by a repair program reporting
success (previous daemon behaviour).
- softboot-option = <yes|no>
- This acts like the -b / --softboot command line and simply
sets the retry timeout to zero.
- admin = <mail-address>
- Email address to send admin mail to. That is, who shall be
notified that the machine is being halted or rebooted. Default is 'root'.
If you want to disable notification via email just set admin to en empty
string.
- realtime = <yes|no>
- If set to yes watchdog will lock itself into memory so it
is never swapped out.
- priority = <schedule priority>
- Set the schedule priority for realtime mode passed to
sched_setscheduler().
- test-directory = <test directory>
- Set the directory to run user test/repair scripts. Default
is '/etc/watchdog.d' See the Test Directory section in watchdog(8) for
more information.
- log-dir = <log directory>
- Set the log directory to capture the standard output and
standard error from repair-binary and test-binary execution. Default is
'/var/log/watchdog'.
- sigterm-delay = <time in seconds>
- Set the time on shut down between first sending SIGTERM to
all processes, and then sending SIGKILL. Default is 5 seconds which is
generally enough, but systems with large databases or virtual machines
might need longer.
- verbose = <level>
- This overrides the command line --verbose option. Generally
the verbose mode is only enabled for debugging as it creates a lot of
syslog chatter, so use this option with consideration. Zero is
"normal" operation (quiet), while 1 is typically used for
debugging. Values of 2 or more usually generate far too many
messages.
- heartbeat-file = <filename>
- For debugging this allows a rolling set of status values to
be kept on disk
- heartbeat-stamps = <interval>
- For debugging this sets the number of entries in the
<heartbeat-file>
- log-killed-pids = <yes|no>
- This acts like enabling 'verbose' logging, but only for a
system reboot, where it enables the logging of the PID values for all
processes that are being killed. The results are written to the
killall5.log file in the log directory (if at all possible) in this case.
Intended for debugging cases where you would like to know what was running
at the point the machine triggered the watchdog, but don't want syslog
filling up with the usual chatter of activity.
- /etc/watchdog.conf
- The watchdog configuration file
- /etc/watchdog.d
- A directory containing test-or-repair commands. See the
Test Directory section in watchdog(8) for more information.
watchdog(8)