Devuan Documentation

Installing Runit as a Supervisor on Devuan Jessie

Version 20170721_1320

Scope of this Document

This document walks you through installing the runit init system as a process supervisor, on Devuan Jessie, using sysvinit's PID1 as its PID1. This is a logical first step in completely replacing sysvinit with runit, and it also cures 90% of the percieved "ills" of sysvinit. I would not be the slightest bit ashamed to permanantly run sysvinit's PID1 and early boot, combined with runit as a process supervisor.

What this doc doesn't cover is actually initting with runit (PID1 and early boot (stage1)), running daemons as users other than root, and using s6 instead of runit. Those subjects will be covered in later documentation.

Note to This Documentation's Maintainer

By Steve Litt

You can skip this section if you're not this documentation's maintainer or partial author.

This document has been constructed from the ground up for clarity, understandability and consistency. This is a styled xhtml document so that each element type has its own style, whose appearance can be formatted as desired by changing its definition in this doc's head.

Please do not change this document's format to Markdown or Asciidoc or any kind of simple markup language: Doing so will lose metadata, decreasing clarity. Please don't convert it to regular HTML: Its current Xhtml format makes it much easier to check for well-formedness and browser portability, using xmlchecker.py.

And for gosh sakes, never use a "WYSIWYG" web authoring tool on this document. Doing so would change the structural format of this document from simple, rigorous and portable to complicated, sloppy, non-portable and ultimately non-maintainable.

By spending 20 minutes looking at the styles in this document's <head/>, and seeing how they're used in this document, you'll understand just how simple it is to maintain this document the right way.

The best tool for authoring this document is the Bluefish Editor with Zencoding, with quality control via xmlchecker.py. Best graphic editor is Inkscape

Do yourself and your fellow Devuan people a favor and never have two people editing this document at once. Just because Git allows you to do this doesn't make it optimal --- unless very careful, the multiple authors will step all over each others' work, in spite of Git's attempts to iron out the differences.

Basic Installation

The purpose of the basic installation is to get all of runit's executables in the right place on the computer. Note that different people and distros argue about what is the "right place."

CAUTION:

Almost all the steps in this document should be done while logged in as user root. Be root unless told specifically not to be.

su -
apt-get install make if needed
apt-get install gcc if needed
apt-get install openssh-server if the sshd excecutable isn't on your computer.
cd /usr/local
mkdir package
chmod 1755 package
cd package
Copy the runit tarball into the current directory. As of 7/1/2017 the current tarball was runit-2.1.2.tar.gz.
tar xpvzf runit-2.1.2.tar.gz, or whatever the tarball is named. Be sure to use the -p so sticky bits in the archive are respected
cd admin/runit-2.1.2
package/compile
package/check
Fix any problems that show up on compile or check.
package/install
Note that the preceding command has created a /command directory, which is contrary to the FHS. We'll now move the /command directory so it's under the /usr/local/ directory, and replace it on the root directory with a symlink.
mv /command /usr/local/
ln -s /usr/local/command /command
About that FHS-violating/command symlink: I did that for ease on your first installation: It lets you use the exact installation scripts that come with runit. The symlink can be gotten rid of by creating and running a fairly simple shellscript, but that's an exercise for later.
The basic installation is done. Now it's time to configure runit as a process supervisor only.

Make a Network Tester

Unlike sysvinit, which uses execution order to start needed processes before their dependees, or systemd and upstart, which build dependency trees, runit and other process supervisors like it put tests in their run scripts so a dependee doesn't start unless its needed processes are already running. The runit way has both advantages and disadvantages.

The disadvantage is that the runit way can run various run scripts kind of like a field of mouse traps, each throwing marbles to set off other mousetraps. It sounds like a mess, but in fact it usually isn't. Another disadvantage is that you have to code tests into your run scripts. Well yeah, that adds a few lines of code, but I don't think I've ever seen a runit run script go over 20 lines of code, so what the heck?

The advantage is that with runit, you decide what is meant by a needed process being "ready". You don't need to trust a dbus message saying it's ready, hoping the daemon's author chose the right time to contact dbus. You don't need to wait for a daemon to "put itself in the background" and return, hoping the daemon author chose the right time to put himself in the background. You create the test.

For instance, if you want to test tcp/ip connectivity on your LAN, you could ping the address of an always-up machine on the LAN. If you want to make sure your network device is configured properly, you could use the output of ip link, ip addr, and ip route to do so. If you want to make sure you have both tcp/ip connectivity and DNS over the Internet, you could use the nc command to see if google.com responds on port 80.

What you're going to do is create a shellscript called netisup.sh, which tests an IP address and port combination, returning 0 if that port responds on that IP, and returning 1 if it doesn't (or if the IP address isn't running). This gives you a wide variety of tests you can do, and unlike the ping command, it works in all virtual machine guest queries on the Internet. The following is the code comprising netisup.sh

#!/bin/sh
nc -w2 -z $1 $2 2> /dev/null
return $?

Be sure to put it on the executable path. I recommend /usr/local/bin unless you have a place for executables that's on the path and resides in your data, so that it survives complete reinstalls.

The following are some examples:

[ netisup.sh google.com 80 ] && apt-get upgrade Upgrade if google.com resolves and has port 80.
[ netisup.sh 8.8.8.8 53 ] && runsamba.sh Run Samba if Google's public DNS 8.8.8.8 responds on resolver port 53.
[ netisup.sh 10.0.2.2 22 ] && scp newstuff.txt myuid@10.0.2.2: Copy the new stuff if the VM host has functioning ssh server.

Bare Essentials of How Runit Works

You don't have to read this section, but if you don't read this section and then have to troubleshoot, you're dead meat till you read this section.

Bare Bones Narrative

The supervision part of runit is a process tree. At the very top of the tree is the runsvdir program, which iterates through the link directory (/etc/svlnk in this case), looking for symlinks to directoris.

For each symlink directory found, it runs the svrun program on that directory link. So now you have svrundir as the direct parent of zero or more runsv programs.

Each runsv program that runs executes the run script in its directory, which does a few preliminaries and then replaces itself with the daemon to be run, via a shellscript exec statement.

If a runsv program finds a subdirectory called logwithin its directory, then it runs the run script inside that log directory, creating a second daemon that takes care of all logging.

The following is a process hierarchy representing who runs what:

16822 30290   _ runsvdir /etc/svlnk
30290 30291      \_ runsv sshd
30291 26671      |   \_ svlogd -ttv /etc/sv/sshd/log/main
30291 26672      |   \_ /usr/sbin/sshd -D
30290 29431      \_ runsv ntpd
29431 29432          \_ svlogd -ttv /etc/sv/ntpd/log/main
29431 29433          \_ /usr/sbin/ntpd -d
29433 29436              \_ /usr/sbin/ntpd -d

NOTE:

In the preceding, one ntpd forks the other one. This is a function of ntpd, not of runit.

The Devil's Details

Things aren't quite as simple as a quick read of the preceding section would seem. There are some details.

When svrun already exists

The runsvdir program keeps scanning the link directory for directory links, running runsv on each, but after the first post-boot spin, each almost always has an existing runsv. So its runsv is queried. If runsv shows it already has the daemon running, runsvdir does nothing. If the daemon isn't running, runsv is queried to see if it's not running because the admin ran sv down on the directory, and if so, runsvdir does nothing. Otherwise, runsvdirtells the existing runsv to rerun the daemon.

Log file start and stop

When runsvdir starts, or when a new directory link is made in the link directory, runsvdir starts runsv, which first runs the directory's log directory if one exists, and then runs the daemon itself. This way, the log is running in time to catch the first output of the data.

When runsvdir stops, or when a directory link is deleted in the link directory, that directory's daemon and log are stopped within a few seconds. This is the wrong way to stop a daemon. The right way is as follows:

sv down ntpd ntpd/log

The preceding shuts down the daemon and its log, but leaves its directory's runsv still running. To kill the runsv, so that this daemon will not be run on reboot, perform the following additional command:

rm /etc/svlnk/ntpd

To restore this daemon so it and its log start now and will start on future reboots, perform the following command:

ln -s /etc/sv/ntpd /etc/svlnk/ntpd

Temporarily upping and downing a daemon

You use sv up and sv up to start and stop daemons and their logs. For instance, the following command stops the ntpd daemon but leaves its log file running:

sv down ntpd

Often this is what you want, because a running log consumes almost no resources and carries almost no other disadvantages. If you want to shut down the daemon and its log, use the following command to shut down the daemon before the log, so the log catches everything:

sv down ntpd ntpd/log

When bringing it back up, start the log first so the log catches the very beginning of daemon startup:

sv up ntpd/log ntpd

Always remember, when using the sv command to up and down daemons and their logs, you must specifically address both the daemon and the log. But when services are started by a bootup, or by the runsvdir program starting, or by a new directory symlink linked into the link directory, the log and the daemon are brought up as a package deal, log first.

WARNING: Persistence, State and Intermittence

Runit keeps a heck of a lot of persistent state infoin the following three locations, assuming the daemon and its directory are both called mydaemond:

/etc/sv/mydaemond/supervise
/etc/sv/mydaemond/log/supervise
/etc/sv/mydaemond/log/main/lock

This persistent state information can cause wildly intermittent symptoms, head-scratching behavior, and occasionally long, drawn out troubleshooting. Whenever things start getting weird, you need to get rid of all sources of persistence by deleting the lock file and both the supervise trees, after turning off the daemon and its log.

A State Smashing Shellscript

Depending on how much persistent state impinges on troubleshooting you need to do, things might go faster if you have a shellscript (call it reset_mydaemon.sh), to get rid of all the state and restart the daemon. The following seems to be a pretty good script that handles errors and gets timings right every time:

#!/bin/sh
daemonname=$1

# Test syntax
if test "$daemonname" = ""; then
   echo Syntax is reset_mydaemon.sh daemon_name <&
   exit 1
fi

# Directory names
srcdir=/etc/sv
lnkdir=/etc/svlnk
symlink=$lnkdir/$daemonname

# Test for wrong/no such daemonname
if test "$symlink" = "lnkdir" -o ! -r $srcdir/$daemonname; then
   echo Bad daemon name $symlink
   exit 1
fi

# Down service, log, and remove symlink
echo
echo Downing service and any log
sv down $symlink $symlink/log

echo Removing $symlink to take down runsv
rm $symlink
while /bin/true; do
  if ps axo pid,cmd | grep "runsv $daemonname$"; then
	echo -n "Waiting for runsv to terminate...  "
	sleep 1;
  else
	sleep 1;
	echo
  	break;
  fi
done

# Remove everything keeping persistent state
echo Removing all persistent state
cd $srcdir/$daemonname
rm -rf $srcdir/$daemonname/log/supervise
rm -rf $srcdir/$daemonname/supervise
rm $srcdir/$daemonname/log/main/lock

# Start up the service
echo
echo Replacing $symlink to run runsv
ln -s $srcdir/$daemonname $symlink
echo
while /bin/true; do
  if ! ps axo pid,cmd | grep "runsv $daemonname$"; then
	echo "Waiting for runsv to come online...  "
	sleep 1;
  else
	sleep 1;
  	break;
  fi
done

# Show results
echo Here's what's running: PPID, PID and CMD
ps axfo ppid,pid,cmd | grep -v grep | \
  grep -e runsvdir -e $daemonname

A surprise persisting state issue can add hours to your troubleshooting. This State Smasher Shellscript isn't perfect or risk free, but personally, on anything but an important production machine, I'd use it early and often.

Move sshd from sysvinit to runit

As a proof of concept, we'll move the SSH daemon, sshd from sysvinit to runit. By the end of this section, the SSH daemon is supervised by runit. As time goes on, you can move other important daemons to runit. The beauty of running them from runit is:

Init config never changes with new versions of the daemon, unless the daemon's command line args change.
Daemons are restarted after failure. Through a service's ./finish script, any number of actions can be taken on failure, including warning the admin.
Any home-grown daemons you create have no need for either self-backgrounding nor PID files.
PID files are a thing of the past. No more stale PID files leading to bizarre happenings.

Disable sysvinit's Running of sshd

You don't want the sshd daemon twice (once by sysvinit and once by runit), so you must disable its starting in sysvinit.

WARNING!

Right now back up file /etc/init.d/ssh. The sshd command from this script will be consulted when you create your runit run script.

If you can ONLY access this machine with ssh

Be careful. If you kill all instances of sshd, you won't be able to get back into this machine. So (almost) disable sshd by placing the following two lines immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:

/usr/sbin/sshd -p 54345
exit 0

/etc/init.d/ssh start to start sshd on port 54345
ps ax | grep 54345 to prove you succeeded
From your local machine, ssh -p 54345 username@target_machine_ip to ssh back in on port 54345.
ps ax | grep sshd | grep -v runsv | grep -v 54345 to find a list of any port 22 sshd sessions you need to kill.
Kill the sshd sessions that don't depend on port 54345.
You have now disabled sysvinit supplied sshd enough to proceed.

If you can access this machine directly, without ssh

This is much easier. Disable sshd by placing the following line immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:

exit 0

killall sshd
ps ax | grep sshd to prove you succeeded in killing all sshd instances.
You have now disabled sysvinit supplied sshd enough to proceed.

Yes, this was a kludge

Obviously there are more idiomatic Devuan ways to disable sshd. Just be sure that whatever disablement you use prevents sysvinit from starting a sshd on port 22 at boot time, and make sure the sysvinit-started sshd is not running before installing it in runit.

Make and Operate sshd runit Service Directory

ps ax | grep sshd | grep -v grep If the preceding produces output, go back and Disable sysvinit's Running of sshd.
mkdir /etc/sv
mkdir /etc/svlnk
ln -s /etc/svlnk /service as a convenience to type sv status sshd instead of sv status /etc/svlnk/sshd This is a substantial FileSystem Hierarchy (FHS) violation. The runit installation and/or source code can be changed to provide the convenience without the violation, but this document doesn't go that far..
runsvdir /etc/svlnk If the output appears to "hang" with no output, so far so good.
Ctrl+C out of runsvdir
mkdir /etc/sv/sshd

Create the following /etc/sv/sshd/run:

#! /bin/sh
exec 2>&1
echo Checking for network up before running sshd
if netisup.sh 8.8.8.8 53 ; then
 mkdir -p /var/run/sshd
 chmod 0755 /var/run/sshd
 echo Executing sshd
 exec /usr/sbin/sshd -D
 rmdir /var/run/sshd
fi
echo sshd daemon failed to run
sleep 1

Click here for explanation of run script

chmod a+x /etc/sv/sshd/run
ln -s /etc/sv/sshd /etc/svlnk/sshd
runsvdir /etc/svlnk

On another terminal, ps ax | grep sshd | grep -v grep. The output should look something like this:

myuid@jessie:~$ ps axo ppid,pid,stat,time,cmd | grep sshd | grep -v grep
 4691  4692 S+   00:00:00 runsv sshd
 4692  4693 S+   00:00:00 /usr/sbin/sshd -D
myuid@jessie:~$

Note that the PID of runsv is the parent PID of sshd. runsv is supervising sshd

Wait five seconds and rerun the preceding command. You should get the same output, with the same PID numbers. If so, you've successfully installed sshd as a service on runit, and can skip the next subsection, which is on troubleshooting.

Troubleshoot

You can skip this subsection if the final step of the preceding subsection indicated everything was functioning. Otherwise, troubleshoot.

First, here are a few generic tips when troubleshooting any process supervisor, including runit:

Make sure /etc/sv/sshd exists.
Make sure /etc/svlnk/sshd exists, and is a symlink to /etc/sv/sshd.
Make sure /etc/sv/sshd/run exists, and is executable by all. If, for security, you want to make it executable only by the user running the daemon, you can do that later after everything else is working perfectly.
As a generic timesaver, run reset_mydaemon.sh sshd. That might be a quick and easy fix, although it loses the root cause, which at this point you probably don't care about. You can learn more about this shellscript here.
The ps command is your friend. Use it early and often. Pay particular attention to PPID, PID, and CMD. When viewing it as a tree (-f, pipe the output into the less command. Otherwise, grep is usually an excellent way to view the results. The following are some excellent ps commands:
- ps axfo ppid,pid,cmd | less
- ps axfo ppid,pid,time,cmd | \
  grep -e sshd -e runsvdir | \
  grep -v grep
- ps axo ppid,pid,time,cmd | \
  grep -e sshd -e runsvdir | \
  grep -v grep
- ps axfo ppid,pid,time,cmd | grep runsv | grep -v grep
Make sure runsvdir is running, with the symlinks directory as its argument.
Make sure the daemon to be run, in this case sshd, has a runsv process.
The sv status command is your friend. It tells you whether a daemon is up or down, and how much time it's been in that state. If it keeps being up with 0 or 1 seconds in the state, that means it's failing and repeating.
If sv status sshd
If things start seeming iffy or squirrelly or indeterminate, I'd use the State Smasher Shellscript early and often. Don't wait too long: unexpected persistent state can cost hours in troubleshooting.
Try running the /etc/sv/sshd/run script, as root, from the command prompt. If it fails, find why. If it succeeds, find what's different between running it from the command prompt and from runsv. Environment vars? User? Permissions?.
Run command /usr/sbin/sshd -D and make sure it starts, does not finish, and does not issue any error messages.
In the preceding command, change -D to -d to see debug information.
ps axo ppid,pid,time,cmd | \
grep -e sshd -e runsvdir | \
grep -v grep
ps axfo ppid,pid,time,cmd | grep runsv | grep -v grep

Explanation of sshd run script

The sshd run script looks like the following:

#! /bin/sh
exec 2>&1
echo Checking for network up before running sshd
if netisup.sh 8.8.8.8 53 ; then
 mkdir -p /var/run/sshd
 chmod 0755 /var/run/sshd
 echo Executing sshd
 exec /usr/sbin/sshd -D
 rmdir /var/run/sshd
fi
echo sshd daemon failed to run
sleep 1

There are four parts:

The shebang (&!/bin/sh)
The redirection (exec 2>&1)
The if statement.
The sleep statement.

Discuss the three easy ones first. The shebang begins every shellscript, including this one. The redirect redirects everything that is sent to stderr (file descriptor 2) to stdout (file descriptor 1). This is important because runit sends everything from stdout to the log. So the redirect makes sure all output to stderr gets logged.

The sleep at the end spends one second so that, if sshd does not run correctly, runit doesn't instantly try again. This may be unnecessary.

Now let's discuss the if statement, which consists of three things:

The actual if
The execution, at the current PID, of sshd if true.
The scaffolding that takes place if true

The actual if is testing if the network is up. You want the network up before sshd. This is a process dependency.

The execution of /usr/sbin/sshd -D stops doing the current process, and starts doing /usr/sbin/sshd -D within the current process, if the /usr/sbin/sshd -D call succeeds. If the call succeeds, the remainder of the run script is not executed, so the line containing rmdir never gets done.

The scaffolding creates directory /var/run/sshd, which is required by sshd in order to run. If the exec to sshd fails, then /var/run/sshd is removed. But if the call to sshd succeeds, the directory is left intact, because the rm line never gets executed.

If you came to this subsection by clicking a link, use your browser's back button to return to where you came from.

Incorporate Logging

With runsvdir running on one terminal, do all the following on another...
sv down /etc/svlnk/* Turn off all runit-supervised daemons.
mkdir /etc/sv/sshd/log

Create the following mkdir /etc/sv/sshd/log/run:

#!/bin/sh
exec  2>&1
exec svlogd -ttv /etc/sv/sshd/Main

chmod a+x /etc/sv/sshd/log/run
sv up /etc/svlnk/sshd/log /etc/svlnk/sshd
Look at /etc/svlnk/sshd/log to see what's written there.
ssh into the server at port 22, and see if it lets you in.
Look again at /etc/svlnk/sshd/log to see what's written there.

Troubleshoot

You can skip this subsection if sshd and sshd loggging appear to work. Otherwise, troubleshoot.

First, here are a few generic tips when troubleshooting any process supervisor, including runit:

Make sure /etc/sv/sshd/log exists.
Make sure /etc/svlnk/sshd exists, and is a symlink to /etc/sv/sshd.
Make sure /etc/sv/sshd/log/run exists, and is executable by all. If, for security, you want to make it executable only by the user running the daemon, you can do that later after everything else is working perfectly.
As a generic timesaver, run reset_mydaemon.sh sshd. That might be a quick and easy fix, although it loses the root cause, which at this point you probably don't care about. You can learn more about this shellscript here.
The ps command is your friend. Use it early and often. Pay particular attention to PPID, PID, and CMD. When viewing it as a tree (-f, pipe the output into the less command. Otherwise, grep is usually an excellent way to view the results. The following are some excellent ps commands:
- ps axfo ppid,pid,cmd | less
- ps axfo ppid,pid,time,cmd | \
  grep -e sshd -e runsvdir | \
  grep -v grep
Make sure the daemon to be run, in this case sshd, has a runsv process.
The sv status command is your friend. It tells you whether a daemon is up or down, and how much time it's been in that state. If it keeps being up with 0 or 1 seconds in the state, that means it's failing and repeating.
sv status sshd sshd/log
If things start seeming iffy or squirrelly or indeterminate, I'd use the State Smasher Shellscript early and often. Don't wait too long: unexpected persistent state can cost hours in troubleshooting.
Try running the /etc/sv/sshd/log/run script, as root, from the command prompt. If it fails, find why. If it succeeds, find what's different between running it from the command prompt and from runsv. Environment vars? User? Permissions?.
ps axo ppid,pid,time,cmd | \
grep -e sshd -e runsvdir | \
grep -v grep
ps axfo ppid,pid,time,cmd | grep runsv | grep -v grep

Start Runit From Sysvinit

Throughout this document, we've started runit by typing the following on a terminal logged in as root:

/usr/sbin/runsvdir /etc/svlnk

There's a reason for that. We needed to be able to turn on and off runsvdir, even if we had to use Ctrl+C to do it. For debugging, we also had to view the output of runsvdir in real time.

But now everything works, so it's time for runsvdir to run upon reboot. This would have been very simple, except that paths during boot might not be complete. So the first step is to put the directory containing all runit executables into the path, using the following /usr/local/bin/runsvdir.sh shellscript:

#!/bin/sh
lnkdir=$1
cmddir=/usr/local/command
echo $PATH | grep -q "$cmddir" || export PATH=$PATH:$cmddir
exec /usr/local/bin/runsvdir $lnkdir

Basically, if $cmddir is not on the $PATH, it's appended. And $cmddir is set to /usr/local/command because that's where we put that directory during installation. This shellscript guarantees that the runit executables will be on the $PATH during boot.

Now perform the following steps:

Edit /etc/inittab
Open a blank line below all the lines containing respawn:/sbin/getty
In the open line, insert
SV:123456:respawn:/usr/local/bin/runsvdir.sh /etc/svlnk
Save and exit your editor.
Reboot

ps axfo ppid,pid,cmd | grep -v grep | \
grep -e sshd -e runsvdir

What You've Accomplished

What you've done is install runit and move one daemon (sshd) from sysvinit to runit's process supervisor, thereby proving the concept. In fact, a computer that early-boots sysvinit and relies on runit to supervise its daemons is a powerful computer on its own, without changing PID1 and the early boot.

Better yet, if your eventual goal is to init completely from runit, by transferring your daemons from sysvinit to runit you've done about half the job.

Todo

This document is just a beginning. It didn't really set up an FHS (Filesystem Hierarchy) compliant setup, with the /command and /service symlinks. For some distros, organizations and admins, this is unaccceptable. It can be worked around, but would make installation a little more complicated, so I decided not to do it.

Obviously, nothing in this document did anything to replace sysvinit's PID1 and early boot with those from runit. That will require quite a bit of documentation.

Last but not least, this document is for runit. The s6 supervisor, and the s6/s6-rc combination init system, need to be documented similar to runit. I did runit first because I use it every day and am familiar with it.