Devuan Documentation
Installing Runit as a Supervisor on Devuan Jessie
Version 20170721_1320
Copyright (C) 2017 by Steve Litt. License to be determined, intended to be a free content license. No warranty, use at your own risk.
This document walks you through installing the runit init system as a process supervisor, on Devuan Jessie, using sysvinit's PID1 as its PID1. This is a logical first step in completely replacing sysvinit with runit, and it also cures 90% of the percieved "ills" of sysvinit. I would not be the slightest bit ashamed to permanantly run sysvinit's PID1 and early boot, combined with runit as a process supervisor.
What this doc doesn't cover is actually initting with runit (PID1 and early boot (stage1)), running daemons as users other than root, and using s6 instead of runit. Those subjects will be covered in later documentation.
You can skip this section if you're not this documentation's maintainer or partial author.
This document has been constructed from the ground up for clarity, understandability and consistency. This is a styled xhtml document so that each element type has its own style, whose appearance can be formatted as desired by changing its definition in this doc's head.
Please do not change this document's format to Markdown or Asciidoc or any kind of simple markup language: Doing so will lose metadata, decreasing clarity. Please don't convert it to regular HTML: Its current Xhtml format makes it much easier to check for well-formedness and browser portability, using xmlchecker.py.
And for gosh sakes, never use a "WYSIWYG" web authoring tool on this document. Doing so would change the structural format of this document from simple, rigorous and portable to complicated, sloppy, non-portable and ultimately non-maintainable.
By spending 20 minutes looking at the styles in this document's <head/>, and seeing how they're used in this document, you'll understand just how simple it is to maintain this document the right way.
The best tool for authoring this document is the Bluefish Editor with Zencoding, with quality control via xmlchecker.py. Best graphic editor is Inkscape
Do yourself and your fellow Devuan people a favor and never have two people editing this document at once. Just because Git allows you to do this doesn't make it optimal --- unless very careful, the multiple authors will step all over each others' work, in spite of Git's attempts to iron out the differences.
The purpose of the basic installation is to get all of runit's executables in the right place on the computer. Note that different people and distros argue about what is the "right place."
CAUTION:
Almost all the steps in this document should be done while logged in as user root. Be root unless told specifically not to be.
Unlike sysvinit, which uses execution order to start needed processes before their dependees, or systemd and upstart, which build dependency trees, runit and other process supervisors like it put tests in their run scripts so a dependee doesn't start unless its needed processes are already running. The runit way has both advantages and disadvantages.
The disadvantage is that the runit way can run various run scripts kind of like a field of mouse traps, each throwing marbles to set off other mousetraps. It sounds like a mess, but in fact it usually isn't. Another disadvantage is that you have to code tests into your run scripts. Well yeah, that adds a few lines of code, but I don't think I've ever seen a runit run script go over 20 lines of code, so what the heck?
The advantage is that with runit, you decide what is meant by a needed process being "ready". You don't need to trust a dbus message saying it's ready, hoping the daemon's author chose the right time to contact dbus. You don't need to wait for a daemon to "put itself in the background" and return, hoping the daemon author chose the right time to put himself in the background. You create the test.
For instance, if you want to test tcp/ip connectivity on your LAN, you could ping the address of an always-up machine on the LAN. If you want to make sure your network device is configured properly, you could use the output of ip link, ip addr, and ip route to do so. If you want to make sure you have both tcp/ip connectivity and DNS over the Internet, you could use the nc command to see if google.com responds on port 80.
What you're going to do is create a shellscript called netisup.sh, which tests an IP address and port combination, returning 0 if that port responds on that IP, and returning 1 if it doesn't (or if the IP address isn't running). This gives you a wide variety of tests you can do, and unlike the ping command, it works in all virtual machine guest queries on the Internet. The following is the code comprising netisup.sh
#!/bin/sh nc -w2 -z $1 $2 2> /dev/null return $?
Be sure to put it on the executable path. I recommend /usr/local/bin unless you have a place for executables that's on the path and resides in your data, so that it survives complete reinstalls.
The following are some examples:
You don't have to read this section, but if you don't read this section and then have to troubleshoot, you're dead meat till you read this section.
The supervision part of runit is a process tree. At the very top of the tree is the runsvdir program, which iterates through the link directory (/etc/svlnk in this case), looking for symlinks to directoris.
For each symlink directory found, it runs the svrun program on that directory link. So now you have svrundir as the direct parent of zero or more runsv programs.
Each runsv program that runs executes the run script in its directory, which does a few preliminaries and then replaces itself with the daemon to be run, via a shellscript exec statement.
If a runsv program finds a subdirectory called logwithin its directory, then it runs the run script inside that log directory, creating a second daemon that takes care of all logging.
The following is a process hierarchy representing who runs what:
16822 30290 _ runsvdir /etc/svlnk 30290 30291 \_ runsv sshd 30291 26671 | \_ svlogd -ttv /etc/sv/sshd/log/main 30291 26672 | \_ /usr/sbin/sshd -D 30290 29431 \_ runsv ntpd 29431 29432 \_ svlogd -ttv /etc/sv/ntpd/log/main 29431 29433 \_ /usr/sbin/ntpd -d 29433 29436 \_ /usr/sbin/ntpd -d
NOTE:
In the preceding, one ntpd forks the other one. This is a function of ntpd, not of runit.
Things aren't quite as simple as a quick read of the preceding section would seem. There are some details.
The runsvdir program keeps scanning the link directory for directory links, running runsv on each, but after the first post-boot spin, each almost always has an existing runsv. So its runsv is queried. If runsv shows it already has the daemon running, runsvdir does nothing. If the daemon isn't running, runsv is queried to see if it's not running because the admin ran sv down on the directory, and if so, runsvdir does nothing. Otherwise, runsvdirtells the existing runsv to rerun the daemon.
When runsvdir starts, or when a new directory link is made in the link directory, runsvdir starts runsv, which first runs the directory's log directory if one exists, and then runs the daemon itself. This way, the log is running in time to catch the first output of the data.
When runsvdir stops, or when a directory link is deleted in the link directory, that directory's daemon and log are stopped within a few seconds. This is the wrong way to stop a daemon. The right way is as follows:
sv down ntpd ntpd/log
The preceding shuts down the daemon and its log, but leaves its directory's runsv still running. To kill the runsv, so that this daemon will not be run on reboot, perform the following additional command:
rm /etc/svlnk/ntpd
To restore this daemon so it and its log start now and will start on future reboots, perform the following command:
ln -s /etc/sv/ntpd /etc/svlnk/ntpd
You use sv up and sv up to start and stop daemons and their logs. For instance, the following command stops the ntpd daemon but leaves its log file running:
sv down ntpd
Often this is what you want, because a running log consumes almost no resources and carries almost no other disadvantages. If you want to shut down the daemon and its log, use the following command to shut down the daemon before the log, so the log catches everything:
sv down ntpd ntpd/log
When bringing it back up, start the log first so the log catches the very beginning of daemon startup:
sv up ntpd/log ntpd
Always remember, when using the sv command to up and down daemons and their logs, you must specifically address both the daemon and the log. But when services are started by a bootup, or by the runsvdir program starting, or by a new directory symlink linked into the link directory, the log and the daemon are brought up as a package deal, log first.
Runit keeps a heck of a lot of persistent state infoin the following three locations, assuming the daemon and its directory are both called mydaemond:
This persistent state information can cause wildly intermittent symptoms, head-scratching behavior, and occasionally long, drawn out troubleshooting. Whenever things start getting weird, you need to get rid of all sources of persistence by deleting the lock file and both the supervise trees, after turning off the daemon and its log.
Depending on how much persistent state impinges on troubleshooting you need to do, things might go faster if you have a shellscript (call it reset_mydaemon.sh), to get rid of all the state and restart the daemon. The following seems to be a pretty good script that handles errors and gets timings right every time:
#!/bin/sh daemonname=$1 # Test syntax if test "$daemonname" = ""; then echo Syntax is reset_mydaemon.sh daemon_name <& exit 1 fi # Directory names srcdir=/etc/sv lnkdir=/etc/svlnk symlink=$lnkdir/$daemonname # Test for wrong/no such daemonname if test "$symlink" = "lnkdir" -o ! -r $srcdir/$daemonname; then echo Bad daemon name $symlink exit 1 fi # Down service, log, and remove symlink echo echo Downing service and any log sv down $symlink $symlink/log echo Removing $symlink to take down runsv rm $symlink while /bin/true; do if ps axo pid,cmd | grep "runsv $daemonname$"; then echo -n "Waiting for runsv to terminate... " sleep 1; else sleep 1; echo break; fi done # Remove everything keeping persistent state echo Removing all persistent state cd $srcdir/$daemonname rm -rf $srcdir/$daemonname/log/supervise rm -rf $srcdir/$daemonname/supervise rm $srcdir/$daemonname/log/main/lock # Start up the service echo echo Replacing $symlink to run runsv ln -s $srcdir/$daemonname $symlink echo while /bin/true; do if ! ps axo pid,cmd | grep "runsv $daemonname$"; then echo "Waiting for runsv to come online... " sleep 1; else sleep 1; break; fi done # Show results echo Here's what's running: PPID, PID and CMD ps axfo ppid,pid,cmd | grep -v grep | \ grep -e runsvdir -e $daemonname
A surprise persisting state issue can add hours to your troubleshooting. This State Smasher Shellscript isn't perfect or risk free, but personally, on anything but an important production machine, I'd use it early and often.
As a proof of concept, we'll move the SSH daemon, sshd from sysvinit to runit. By the end of this section, the SSH daemon is supervised by runit. As time goes on, you can move other important daemons to runit. The beauty of running them from runit is:
You don't want the sshd daemon twice (once by sysvinit and once by runit), so you must disable its starting in sysvinit.
WARNING!
Right now back up file /etc/init.d/ssh. The sshd command from this script will be consulted when you create your runit run script.
Be careful. If you kill all instances of sshd, you won't be able to get back into this machine. So (almost) disable sshd by placing the following two lines immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:
/usr/sbin/sshd -p 54345 exit 0
This is much easier. Disable sshd by placing the following line immediately below the shebang (#!/bin/sh) of /etc/init.d/ssh:
exit 0
Obviously there are more idiomatic Devuan ways to disable sshd. Just be sure that whatever disablement you use prevents sysvinit from starting a sshd on port 22 at boot time, and make sure the sysvinit-started sshd is not running before installing it in runit.
#! /bin/sh exec 2>&1 echo Checking for network up before running sshd if netisup.sh 8.8.8.8 53 ; then mkdir -p /var/run/sshd chmod 0755 /var/run/sshd echo Executing sshd exec /usr/sbin/sshd -D rmdir /var/run/sshd fi echo sshd daemon failed to run sleep 1Click here for explanation of run script
myuid@jessie:~$ ps axo ppid,pid,stat,time,cmd | grep sshd | grep -v grep 4691 4692 S+ 00:00:00 runsv sshd 4692 4693 S+ 00:00:00 /usr/sbin/sshd -D myuid@jessie:~$Note that the PID of runsv is the parent PID of sshd. runsv is supervising sshd
You can skip this subsection if the final step of the preceding subsection indicated everything was functioning. Otherwise, troubleshoot.
First, here are a few generic tips when troubleshooting any process supervisor, including runit:
The sshd run script looks like the following:
#! /bin/sh exec 2>&1 echo Checking for network up before running sshd if netisup.sh 8.8.8.8 53 ; then mkdir -p /var/run/sshd chmod 0755 /var/run/sshd echo Executing sshd exec /usr/sbin/sshd -D rmdir /var/run/sshd fi echo sshd daemon failed to run sleep 1
There are four parts:
Discuss the three easy ones first. The shebang begins every shellscript, including this one. The redirect redirects everything that is sent to stderr (file descriptor 2) to stdout (file descriptor 1). This is important because runit sends everything from stdout to the log. So the redirect makes sure all output to stderr gets logged.
The sleep at the end spends one second so that, if sshd does not run correctly, runit doesn't instantly try again. This may be unnecessary.
Now let's discuss the if statement, which consists of three things:
The actual if is testing if the network is up. You want the network up before sshd. This is a process dependency.
The execution of /usr/sbin/sshd -D stops doing the current process, and starts doing /usr/sbin/sshd -D within the current process, if the /usr/sbin/sshd -D call succeeds. If the call succeeds, the remainder of the run script is not executed, so the line containing rmdir never gets done.
The scaffolding creates directory /var/run/sshd, which is required by sshd in order to run. If the exec to sshd fails, then /var/run/sshd is removed. But if the call to sshd succeeds, the directory is left intact, because the rm line never gets executed.
If you came to this subsection by clicking a link, use your browser's back button to return to where you came from.
#!/bin/sh exec 2>&1 exec svlogd -ttv /etc/sv/sshd/Main
You can skip this subsection if sshd and sshd loggging appear to work. Otherwise, troubleshoot.
First, here are a few generic tips when troubleshooting any process supervisor, including runit:
Throughout this document, we've started runit by typing the following on a terminal logged in as root:
/usr/sbin/runsvdir /etc/svlnk
There's a reason for that. We needed to be able to turn on and off runsvdir, even if we had to use Ctrl+C to do it. For debugging, we also had to view the output of runsvdir in real time.
But now everything works, so it's time for runsvdir to run upon reboot. This would have been very simple, except that paths during boot might not be complete. So the first step is to put the directory containing all runit executables into the path, using the following /usr/local/bin/runsvdir.sh shellscript:
#!/bin/sh lnkdir=$1 cmddir=/usr/local/command echo $PATH | grep -q "$cmddir" || export PATH=$PATH:$cmddir exec /usr/local/bin/runsvdir $lnkdir
Basically, if $cmddir is not on the $PATH, it's appended. And $cmddir is set to /usr/local/command because that's where we put that directory during installation. This shellscript guarantees that the runit executables will be on the $PATH during boot.
Now perform the following steps:
ps axfo ppid,pid,cmd | grep -v grep | \ grep -e sshd -e runsvdir
What you've done is install runit and move one daemon (sshd) from sysvinit to runit's process supervisor, thereby proving the concept. In fact, a computer that early-boots sysvinit and relies on runit to supervise its daemons is a powerful computer on its own, without changing PID1 and the early boot.
Better yet, if your eventual goal is to init completely from runit, by transferring your daemons from sysvinit to runit you've done about half the job.
This document is just a beginning. It didn't really set up an FHS (Filesystem Hierarchy) compliant setup, with the /command and /service symlinks. For some distros, organizations and admins, this is unaccceptable. It can be worked around, but would make installation a little more complicated, so I decided not to do it.
Obviously, nothing in this document did anything to replace sysvinit's PID1 and early boot with those from runit. That will require quite a bit of documentation.
Last but not least, this document is for runit. The s6 supervisor, and the s6/s6-rc combination init system, need to be documented similar to runit. I did runit first because I use it every day and am familiar with it.