Monit restart not working on CentOS – no pid-file after restart

Monitor-heartbeatMonit is a monitoring tool that can monitor processes, files and more. In the event of a failed monitor, monit is able to alert or trigger an action. Often the action triggered is to restart the monitored daemon. With the current repository version from CentOS (verion 5.5) monit does not properly restart many daemons. In the following I will explain the reason as well as a possible solution.

The problem

When monit detects a service as down, or if monit is called with the “restart” option, it executes the stop command from the init script (as configured in the “stop program”) and checks the PID to be terminated. As soon as the process is gone, monit executes the start command from the init script (as configured in the “start program”). With some daemons this causes problems. Most init scripts send a signal to the daemon to terminate it and then wait afterwards for a fixed time before they check that it has been ended. While waiting, monit detects the daemon to be ended and starts the daemon already by executing the “start program” command. Monit does not wait for the stop script to be finished. After the fixed time has passed in the init script executed with “stop”, it checks again the PID of the daemon. As the daemon with that PID has ended, the init script executed with “stop” deletes the PID-file. As the init script executed with “stop” is not aware of the already executed “start”, it deleted the PID-file which was already updated by the “start” command. The result of this race condition is a running daemon but no PID-file. Monit will detect that there is no PID-file and triggers a restart action which fails as there is no PID-file the init script needs to stop the daemon.

As a side effect, it is not possible to end the daemon by executing the init script with “stop”. Most of the time it is necessary to terminate the daemon manually by obtaining the PID from the process list.

The workaround

The version 5.8.1 (the latest at the time of writing) supports a configuration item named “restart program” which can be used to execute the init script with the “restart” option. This way the init script takes care of the correct stop and start timing and the daemon is properly restarted.

Sadly this version is not yet in the repositories. Because of that, I decided, against my usual practice, to install monit from a binary package instead of the repository. The version 5.5 provided by the repository at the time of writing does not support the “restart program” configuration option. Which means you need to install the version 5.8.1 yourself by downloading the binary package.

$ # for 64 bit systems
$ wget http://mmonit.com/monit/dist/binary/5.8.1/monit-5.8.1-linux-x64.tar.gz
$ # for 32 bit systems
$ wget http://mmonit.com/monit/dist/binary/5.8.1/monit-5.8.1-linux-x86.tar.gz

After downloading the package, it needs to be unpacked and moved into place.

$ tar -xzf monit-5.8.1-linux-*
$ mv -f monit-5.8.1/bin/monit /usr/bin/monit

If there is a monit configuration from a previous version on the system, renaming the configuration is necessary as the CentOS repository version uses a slightly different monit configuration file name.

$ mv /etc/monit.conf /etc/monitrc

If there is no monit configuration file on the system, the following configuration can be used as a starting point.

$ mv monit-5.8.1/conf/monitrc /etc/monitrc

The configuration of monit is the same as with the version 5.5 provided by the CentOS repository except the possibility to provide the “restart program” configuration item which resolves the issue described above. A simple example configuration for the ssh daemon would look like this.

check process sshd with pidfile /var/run/sshd.pid
    start program "/etc/init.d/sshd start"
    stop program "/etc/init.d/sshd stop"
    restart program = "/etc/init.d/sshd restart"
    if failed port 22 protocol ssh for 2 cycles then restart
    if 5 restarts within 5 cycles then timeout

This way, when “monit restart” is executed, instead of executing the stop and start commands the restart command is executed and the control of the timing is given to the init script.

Personally I think this way of resolving the issue is not a real solution but more a workaround. Sadly the developer of monit seems to see this a bit differently.


Read more of my posts on my blog at http://blog.tinned-software.net/.

This entry was posted in Linux Administration, Monitoring and tagged , . Bookmark the permalink.