NAME
watchdog - a software watchdog daemonSYNOPSIS
watchdog [-F|--foreground] [-f|--force] [-c filename|--config-file filename] [-v|--verbose] [-s|--sync] [-b|--softboot] [-q|--no-action]DESCRIPTION
The Linux kernel can reset the system if serious problems are detected. This can be implemented via special watchdog hardware, or via a slightly less reliable software-only watchdog inside the kernel. Either way, there needs to be a daemon that tells the kernel the system is working fine. If the daemon stops doing that, the system is reset. watchdog is such a daemon. It opens /dev/watchdog, and keeps writing to it often enough to keep the kernel from resetting, at least once per minute. Each write delays the reboot time another minute. After a minute of inactivity the watchdog hardware will cause the reset. In the case of the software watchdog the ability to reboot will depend on the state of the machines and interrupts. The watchdog daemon can be stopped without causing a reboot if the device /dev/watchdog is closed correctly, unless your kernel is compiled with the CONFIG_WATCHDOG_NOWAYOUT option enabled.TESTS
The watchdog daemon does several tests to check the system status:- •
- Is the process table full?
- •
- Is there enough free memory?
- •
- Is there enough allocatable memory?
- •
- Are some files accessible?
- •
- Have some files changed within a given interval?
- •
- Is the average work load too high?
- •
- Has a file table overflow occurred?
- •
- Is a process still running? The process is specified by a pid file.
- •
- Do some IP addresses answer to ping?
- •
- Do network interfaces receive traffic?
- •
- Is the temperature too high? (Temperature data not always available.)
- •
- Execute a user defined command to do arbitrary tests.
- •
- Execute one or more test/repair commands found in /etc/watchdog.d. These commands are called with the argument test or repair.
OPTIONS
Available command line options are the following:- -v, --verbose
- Set verbose mode. Only implemented if compiled with SYSLOG feature. This mode will log each several infos in LOG_DAEMON with priority LOG_DEBUG. This is useful if you want to see exactly what happened until the watchdog rebooted the system. Currently it logs the temperature (if available), the load average, the change date of the files it checks and how often it went to sleep. You can use this twice to enable some more verbose debug message for testing.
- -s, --sync
- Try to synchronize the filesystem every time the process is awake. Note that the system is rebooted if for any reason the synchronizing lasts longer than a minute.
- -b, --softboot
- Soft-boot the system if an error occurs during the main loop, e.g. if a given file is not accessible via the stat(2) call. Note that this does not apply to the opening of /dev/watchdog and /proc/loadavg, which are opened before the main loop starts. Now this is implemented by disabling the error re-try timer.
- -F, --foreground
- Run in foreground mode, useful for running under systemd (for example).
- -f, --force
- Force the usage of the interval given or the maximal load average given in the config file. Without this option these values are sanity checked.
- -c config-file, --config-file config-file
- Use config-file as the configuration file instead of the default /etc/watchdog.conf.
- -q, --no-action
- Do not reboot or halt the machine. This is for testing purposes. All checks are executed and the results are logged as usual, but no action is taken. Also your hardware card or the kernel software watchdog driver is not enabled. NOTE: This still allows 'repair' actions to run, but the daemon itself will not attempt a reboot.
- -X num, --loop-exit num
- Run for 'num' loops then exit as if SIGTERM was received. Intended for test/debug (e.g. using valgrind for checking memory access). If the daemon exits on a loop counter and you have the CONFIG_WATCHDOG_NOWAYOUT option compiled for the kernel or device-driver then an unplanned reboot will follow - be warned!
FUNCTION
After watchdog starts, it puts itself into the background and then tries all checks specified in its configuration file in turn. Between each two tests it will write to the kernel device to prevent a reset. After finishing all tests watchdog goes to sleep for some time. The kernel drivers expects a write to the watchdog device every minute. Otherwise the system will be reset. watchdog will sleep for a configure interval that defaults to 1 second to make sure it triggers the device early enough. Under high system load watchdog might be swapped out of memory and may fail to make it back in in time. Under these circumstances the Linux kernel will reset the machine. To make sure you won't get unnecessary reboots make sure you have the variable realtime set to yes in the configuration file watchdog.conf. This adds real time support to watchdog: it will lock itself into memory and there should be no problem even under the highest of loads. On system running out of memory the kernel will try to free enough memory by killing process. The watchdog daemon itself is exempted from this so-called out-of-memory killer. Also you can specify a maximal allowed load average. Once this load average is reached the system is rebooted. You may specify maximal load averages for 1 minute, 5 minutes or 15 minutes. The default values is to disable this test. Be careful not to set this parameter too low. To set a value less then the predefined minimal value of 2, you have to use the -f option. You can also specify a minimal amount of virtual memory you want to have available as free. As soon as more virtual memory is used action is taken by watchdog. Note, however, that watchdog does not distinguish between different types of memory usage. It just checks for free virtual memory. If you have a machine with temperature sensor(s) you can specify the maximal allowed temperature. Once this temperature is reached on any sensor the system is powered off. The default value is 90 C. Typically the temperature information is provided by the sensors package as files in the virtual filesystem /sys/device and can be found using, for example, the commandfind /sys -name 'temp*input' -print
max-temperature = 75
temperature-sensor = /sys/class/hwmon/hwmon0/device/temp1_input
temperature-sensor = /sys/class/hwmon/hwmon0/device/temp2_input
SOFT REBOOT
A soft reboot (i.e. controlled shutdown and reboot) is initiated for every error that is found. Since there might be no more processes available, watchdog does it all by himself. That means:- 1.
- Kill all processes with SIGTERM.
- 2.
- After a short pause kill all remaining processes with SIGKILL.
- 3.
- Record a shutdown entry in wtmp.
- 4.
- Save the random seed from /dev/urandom. If the device is non-existant or there is no filename for saving this step is skipped.
- 5.
- Turn off accounting.
- 6.
- Turn off quota and swap.
- 7.
- Unmount all partitions
- 8.
- Finally reboot.
CHECK BINARY
If the return code of the check binary is not zero watchdog will assume an error and reboot the system. Be careful with this if you are using the real-time properties of watchdog since watchdog will wait for the return of this binary before proceeding. An exit code smaller than 245 is interpreted as an system error code (see errno.h for details). Values of 245 or larger than are special to watchdog:- 255
- (based on -1 as unsigned 8-bit number) Reboot the system. This is not exactly an error message but a command to watchdog. If the return code is this the watchdog will not try to run a shutdown script instead.
- 254
- Reset the system. This is not exactly an error message but a command to watchdog. If the return code is this the watchdog will attempt to hard-reset the machine without attempting any sort of orderly stopping of process, unmounting of file systems, etc.
- 253
- Maximum load average exceeded.
- 252
- The temperature inside is too high.
- 251
- /proc/loadavg contains no (or not enough) data.
- 250
- The given file was not changed in the given interval.
- 249
- /proc/meminfo contains invalid data.
- 248
- Child process was killed by a signal.
- 247
- Child process did not return in time.
- 246
- Free for personal watchdog-specific use (was -10 as an unsigned 8-bit number).
- 245
- Reserved for an unknown result, for example a slow background test that is still running so neither a success nor an error.
REPAIR BINARY
The repair binary is started with one parameter: the error number that caused watchdog to initiate the boot process. After trying to repair the system the binary should exit with 0 if the system was successfully repaired and thus there is no need to boot anymore. A return value not equal 0 tells watchdog to reboot. The return code of the repair binary should be the error number of the error causing watchdog to reboot. Be careful with this if you are using the real-time properties since watchdog will wait for the return of this binary before proceeding.TEST DIRECTORY
Executables placed in the test directory are discovered by watchdog on startup and are automatically executed. They are bounded time-wise by the test-timeout directive in watchdog.conf./etc/watchdog.d/my-test test
/etc/watchdog.d/my-test repair 42
BUGS
None known so far.AUTHORS
The original code is an example written by Alan Cox <[email protected]>, the author of the kernel driver. All additions were written by Michael Meskes <[email protected]>. Johnie Ingram <[email protected]> had the idea of testing the load average. He also took over the Debian specific work. Dave Cinege <[email protected]> brought up some hardware watchdog issues and helped testing this stuff.FILES
- /dev/watchdog
- The watchdog device.
- /var/run/watchdog.pid
- The pid file of the running watchdog.
SEE ALSO
watchdog.conf(5)February 2019 | 4th Berkeley Distribution |