Devuan Documentation

Installing Runit as a Supervisor on Devuan Jessie

Version 20170721_1320

Scope of this Document

This document walks you through installing the runit init system as a process supervisor, on Devuan Jessie, using sysvinit's PID1 as its PID1. This is a logical first step in completely replacing sysvinit with runit, and it also cures 90% of the percieved "ills" of sysvinit. I would not be the slightest bit ashamed to permanantly run sysvinit's PID1 and early boot, combined with runit as a process supervisor.

What this doc doesn't cover is actually initting with runit (PID1 and early boot (stage1)), running daemons as users other than root, and using s6 instead of runit. Those subjects will be covered in later documentation.

Note to This Documentation's Maintainer

By Steve Litt

You can skip this section if you're not this documentation's maintainer or partial author.

This document has been constructed from the ground up for clarity, understandability and consistency. This is a styled xhtml document so that each element type has its own style, whose appearance can be formatted as desired by changing its definition in this doc's head.

Please do not change this document's format to Markdown or Asciidoc or any kind of simple markup language: Doing so will lose metadata, decreasing clarity. Please don't convert it to regular HTML: Its current Xhtml format makes it much easier to check for well-formedness and browser portability, using xmlchecker.py.

And for gosh sakes, never use a "WYSIWYG" web authoring tool on this document. Doing so would change the structural format of this document from simple, rigorous and portable to complicated, sloppy, non-portable and ultimately non-maintainable.

By spending 20 minutes looking at the styles in this document's <head/>, and seeing how they're used in this document, you'll understand just how simple it is to maintain this document the right way.

The best tool for authoring this document is the Bluefish Editor with Zencoding, with quality control via xmlchecker.py. Best graphic editor is Inkscape

Do yourself and your fellow Devuan people a favor and never have two people editing this document at once. Just because Git allows you to do this doesn't make it optimal --- unless very careful, the multiple authors will step all over each others' work, in spite of Git's attempts to iron out the differences.

Basic Installation

The purpose of the basic installation is to get all of runit's executables in the right place on the computer. Note that different people and distros argue about what is the "right place."

CAUTION:

Almost all the steps in this document should be done while logged in as user root. Be root unless told specifically not to be.

Make a Network Tester

Unlike sysvinit, which uses execution order to start needed processes before their dependees, or systemd and upstart, which build dependency trees, runit and other process supervisors like it put tests in their run scripts so a dependee doesn't start unless its needed processes are already running. The runit way has both advantages and disadvantages.

The disadvantage is that the runit way can run various run scripts kind of like a field of mouse traps, each throwing marbles to set off other mousetraps. It sounds like a mess, but in fact it usually isn't. Another disadvantage is that you have to code tests into your run scripts. Well yeah, that adds a few lines of code, but I don't think I've ever seen a runit run script go over 20 lines of code, so what the heck?

The advantage is that with runit, you decide what is meant by a needed process being "ready". You don't need to trust a dbus message saying it's ready, hoping the daemon's author chose the right time to contact dbus. You don't need to wait for a daemon to "put itself in the background" and return, hoping the daemon author chose the right time to put himself in the background. You create the test.

For instance, if you want to test tcp/ip connectivity on your LAN, you could ping the address of an always-up machine on the LAN. If you want to make sure your network device is configured properly, you could use the output of ip link, ip addr, and ip route to do so. If you want to make sure you have both tcp/ip connectivity and DNS over the Internet, you could use the nc command to see if google.com responds on port 80.

What you're going to do is create a shellscript called netisup.sh, which tests an IP address and port combination, returning 0 if that port responds on that IP, and returning 1 if it doesn't (or if the IP address isn't running). This gives you a wide variety of tests you can do, and unlike the ping command, it works in all virtual machine guest queries on the Internet. The following is the code comprising netisup.sh

#!/bin/sh
nc -w2 -z $1 $2 2> /dev/null
return $?

Be sure to put it on the executable path. I recommend /usr/local/bin unless you have a place for executables that's on the path and resides in your data, so that it survives complete reinstalls.

The following are some examples:

Bare Essentials of How Runit Works

You don't have to read this section, but if you don't read this section and then have to troubleshoot, you're dead meat till you read this section.

Bare Bones Narrative

The supervision part of runit is a process tree. At the very top of the tree is the runsvdir program, which iterates through the link directory (/etc/svlnk in this case), looking for symlinks to directoris.

For each symlink directory found, it runs the svrun program on that directory link. So now you have svrundir as the direct parent of zero or more runsv programs.

Each runsv program that runs executes the run script in its directory, which does a few preliminaries and then replaces itself with the daemon to be run, via a shellscript exec statement.

If a runsv program finds a subdirectory called logwithin its directory, then it runs the run script inside that log directory, creating a second daemon that takes care of all logging.

The following is a process hierarchy representing who runs what:

16822 30290   _ runsvdir /etc/svlnk
30290 30291      \_ runsv sshd
30291 26671      |   \_ svlogd -ttv /etc/sv/sshd/log/main
30291 26672      |   \_ /usr/sbin/sshd -D
30290 29431      \_ runsv ntpd
29431 29432          \_ svlogd -ttv /etc/sv/ntpd/log/main
29431 29433          \_ /usr/sbin/ntpd -d
29433 29436              \_ /usr/sbin/ntpd -d

NOTE:

In the preceding, one ntpd forks the other one. This is a function of ntpd, not of runit.

The Devil's Details

Things aren't quite as simple as a quick read of the preceding section would seem. There are some details.

When svrun already exists

The runsvdir program keeps scanning the link directory for directory links, running runsv on each, but after the first post-boot spin, each almost always has an existing runsv. So its runsv is queried. If runsv shows it already has the daemon running, runsvdir does nothing. If the daemon isn't running, runsv is queried to see if it's not running because the admin ran sv down on the directory, and if so, runsvdir does nothing. Otherwise, runsvdirtells the existing runsv to rerun the daemon.

Log file start and stop

When runsvdir starts, or when a new directory link is made in the link directory, runsvdir starts runsv, which first runs the directory's log directory if one exists, and then runs the daemon itself. This way, the log is running in time to catch the first output of the data.

When runsvdir stops, or when a directory link is deleted in the link directory, that directory's daemon and log are stopped within a few seconds. This is the wrong way to stop a daemon. The right way is as follows:

sv down ntpd ntpd/log

The preceding shuts down the daemon and its log, but leaves its directory's runsv still running. To kill the runsv, so that this daemon will not be run on reboot, perform the following additional command:

rm /etc/svlnk/ntpd

To restore this daemon so it and its log start now and will start on future reboots, perform the following command:

ln -s /etc/sv/ntpd /etc/svlnk/ntpd

Temporarily upping and downing a daemon

You use sv up and sv up to start and stop daemons and their logs. For instance, the following command stops the ntpd daemon but leaves its log file running:

sv down ntpd

Often this is what you want, because a running log consumes almost no resources and carries almost no other disadvantages. If you want to shut down the daemon and its log, use the following command to shut down the daemon before the log, so the log catches everything:

sv down ntpd ntpd/log

When bringing it back up, start the log first so the log catches the very beginning of daemon startup:

sv up ntpd/log ntpd

Always remember, when using the sv command to up and down daemons and their logs, you must specifically address both the daemon and the log. But when services are started by a bootup, or by the runsvdir program starting, or by a new directory symlink linked into the link directory, the log and the daemon are brought up as a package deal, log first.

WARNING: Persistence, State and Intermittence

Runit keeps a heck of a lot of persistent state infoin the following three locations, assuming the daemon and its directory are both called mydaemond:

This persistent state information can cause wildly intermittent symptoms, head-scratching behavior, and occasionally long, drawn out troubleshooting. Whenever things start getting weird, you need to get rid of all sources of persistence by deleting the lock file and both the supervise trees, after turning off the daemon and its log.

A State Smashing Shellscript

Depending on how much persistent state impinges on troubleshooting you need to do, things might go faster if you have a shellscript (call it reset_mydaemon.sh), to get rid of all the state and restart the daemon. The following seems to be a pretty good script that handles errors and gets timings right every time:

#!/bin/sh
daemonname=$1

# Test syntax
if test "$daemonname" = ""; then
   echo Syntax is reset_mydaemon.sh daemon_name <&
   exit 1
fi

# Directory names
srcdir=/etc/sv
lnkdir=/etc/svlnk
symlink=$lnkdir/$daemonname

# Test for wrong/no such daemonname
if test "$symlink" = "lnkdir" -o ! -r $srcdir/$daemonname; then
   echo Bad daemon name $symlink
   exit 1
fi

# Down service, log, and remove symlink
echo
echo Downing service and any log
sv down $symlink $symlink/log

echo Removing $symlink to take down runsv
rm $symlink
while /bin/true; do
  if ps axo pid,cmd | grep "runsv $daemonname$"; then
	echo -n "Waiting for runsv to terminate...  "
	sleep 1;
  else
	sleep 1;
	echo
  	break;
  fi
done

# Remove everything keeping persistent state
echo Removing all persistent state
cd $srcdir/$daemonname
rm -rf $srcdir/$daemonname/log/supervise
rm -rf $srcdir/$daemonname/supervise
rm $srcdir/$daemonname/log/main/lock

# Start up the service
echo
echo Replacing $symlink to run runsv
ln -s $srcdir/$daemonname $symlink
echo
while /bin/true; do
  if ! ps axo pid,cmd | grep "runsv $daemonname$"; then
	echo "Waiting for runsv to come online...  "
	sleep 1;
  else
	sleep 1;
  	break;
  fi
done

# Show results
echo Here's what's running: PPID, PID and CMD
ps axfo ppid,pid,cmd | grep -v grep | \
  grep -e runsvdir -e $daemonname

A surprise persisting state issue can add hours to your troubleshooting. This State Smasher Shellscript isn't perfect or risk free, but personally, on anything but an important production machine, I'd use it early and often.

Move sshd from sysvinit to runit

As a proof of concept, we'll move the SSH daemon, sshd from sysvinit to runit. By the end of this section, the SSH daemon is supervised by runit. As time goes on, you can move other important daemons to runit. The beauty of running them from runit is:

Disable sysvinit's Running of sshd

You don't want the sshd daemon twice (once by sysvinit and once by runit), so you must disable its starting in sysvinit.

WARNING!

Right now back up file /etc/init.d/ssh. The sshd command from this script will be consulted when you create your runit run script.

If you can ONLY access this machine with ssh

Be careful. If you kill all instances of sshd, you won't be able to get back into this machine. So (almost) disable sshd by placing the following two lines immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:

/usr/sbin/sshd -p 54345
exit 0

If you can access this machine directly, without ssh

This is much easier. Disable sshd by placing the following line immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:

exit 0

Yes, this was a kludge

Obviously there are more idiomatic Devuan ways to disable sshd. Just be sure that whatever disablement you use prevents sysvinit from starting a sshd on port 22 at boot time, and make sure the sysvinit-started sshd is not running before installing it in runit.

Make and Operate sshd runit Service Directory

Troubleshoot

You can skip this subsection if the final step of the preceding subsection indicated everything was functioning. Otherwise, troubleshoot.

First, here are a few generic tips when troubleshooting any process supervisor, including runit:

Explanation of sshd run script

The sshd run script looks like the following:

#! /bin/sh
exec 2>&1
echo Checking for network up before running sshd
if netisup.sh 8.8.8.8 53 ; then
 mkdir -p /var/run/sshd
 chmod 0755 /var/run/sshd
 echo Executing sshd
 exec /usr/sbin/sshd -D
 rmdir /var/run/sshd
fi
echo sshd daemon failed to run
sleep 1

There are four parts:

  1. The shebang (&!/bin/sh)
  2. The redirection (exec 2>&1)
  3. The if statement.
  4. The sleep statement.

Discuss the three easy ones first. The shebang begins every shellscript, including this one. The redirect redirects everything that is sent to stderr (file descriptor 2) to stdout (file descriptor 1). This is important because runit sends everything from stdout to the log. So the redirect makes sure all output to stderr gets logged.

The sleep at the end spends one second so that, if sshd does not run correctly, runit doesn't instantly try again. This may be unnecessary.

Now let's discuss the if statement, which consists of three things:

  1. The actual if
  2. The execution, at the current PID, of sshd if true.
  3. The scaffolding that takes place if true

The actual if is testing if the network is up. You want the network up before sshd. This is a process dependency.

The execution of /usr/sbin/sshd -D stops doing the current process, and starts doing /usr/sbin/sshd -D within the current process, if the /usr/sbin/sshd -D call succeeds. If the call succeeds, the remainder of the run script is not executed, so the line containing rmdir never gets done.

The scaffolding creates directory /var/run/sshd, which is required by sshd in order to run. If the exec to sshd fails, then /var/run/sshd is removed. But if the call to sshd succeeds, the directory is left intact, because the rm line never gets executed.

If you came to this subsection by clicking a link, use your browser's back button to return to where you came from.

Incorporate Logging

Troubleshoot

You can skip this subsection if sshd and sshd loggging appear to work. Otherwise, troubleshoot.

First, here are a few generic tips when troubleshooting any process supervisor, including runit:

What You've Accomplished

What you've done is install runit and move one daemon (sshd) from sysvinit to runit's process supervisor, thereby proving the concept. In fact, a computer that early-boots sysvinit and relies on runit to supervise its daemons is a powerful computer on its own, without changing PID1 and the early boot.

Better yet, if your eventual goal is to init completely from runit, by transferring your daemons from sysvinit to runit you've done about half the job.

Todo

This document is just a beginning. It didn't really set up an FHS (Filesystem Hierarchy) compliant setup, with the /command and /service symlinks. For some distros, organizations and admins, this is unaccceptable. It can be worked around, but would make installation a little more complicated, so I decided not to do it.

Obviously, nothing in this document did anything to replace sysvinit's PID1 and early boot with those from runit. That will require quite a bit of documentation.

Last but not least, this document is for runit. The s6 supervisor, and the s6/s6-rc combination init system, need to be documented similar to runit. I did runit first because I use it every day and am familiar with it.