Netboot CentOS 7 on ATAoE

This writeup covers the process of configuring CentOS 7 to boot seamlessly from an ATAoE target.

This is our setup:

  • TFTP server with a MAC-specific boot menu presenting the kernel and initrd. See http://www.syslinux.org/wiki/index.php/PXELINUX
  • Server exporting an LVM Logical Volume with a CentOS 7 install on it using vblade.
  • Computer to boot from the network (This computer has no hard drive and is using the exported LV as it’s disk).

Required components:

  • A kernel that supports AoE and your network card (We are using a custom build 3.17.4 kernel. If you are building your own kernel, the driver for AoE is in Device Drivers —> Block Devices and is called ATA over Ethernet Support. The driver for your network card is in Device Drivers —> Network device support —> Ethernet driver support. The driver for your NIC will be in there somewhere under the appropriate manufacturer).
  • A custom dracut module to bring up the network and discover AoE targets at boot time.
  • Packages:
    • vblade (This is only required on the server exporting the LV).
    • Dracut (dracut, dracut-network, dracut-tools)

Required Steps:

1. Install CentOS 7 onto an LV on the boot server. Since this is not in the scope of this tutorial, I will leave this part up to you. There is plenty of documentation out there (we installed it under KVM first, and exported the resulting disk as an AoE target).
2. Once you have it installed, you need to enter the environment that you deployed in step 1 to install a dracut AoE module and generate an initrd.
3. Next, make sure that you have a kernel that supports aoe and your network card. It is important that both of those kernel modules get built into your initrd or this will not work.
4. At this point (hopefully) we are in the CentOS 7 environment, whether chrooted, virtual machine, or something else; we can now install packages:

1. yum install dracut-network dracut-tools dracut
2. cd /usr/lib/dracut/modules.d/ && ls

You’ll notice that there are a bunch of folders in here, all starting with two digits. The digits signify the order in which the modules are loaded. To make sure all of the prerequisite modules are loaded, we went with 95aoe for our module directory.

3. mkdir 95aoe && cd 95aoe
4. There are 3 files that we need to create — module-setup.sh, parse-aoe.sh, and aoe-up.sh

1. Let’s start with module-setup.sh, since this is the script that will pull into the initrd all of the pieces we need for AoE to work.

#!/bin/bash
# -*- mode: shell-script; indent-tabs-mode: nil; sh-basic-offset: 4; -*-
# ex: ts=8 sw=4 sts=4 et filetype=sh

check() {
        for i in mknod ip rm bash grep sed awk seq echo mkdir; do
                type -P $i >/dev/null || return 1
        done

        return 0
}

depends() {
        echo network
        return 0
}

installkernel() {
        instmods aoe
}

install() {
        inst_multiple mknod ip rm bash grep sed awk seq echo mkdir

        inst "$moddir/aoe-up.sh" "/sbin/aoe-up"
        inst_hook cmdline 98 "$moddir/parse-aoe.sh"
        dracut_need_initqueue
}

2. Next is parse-aoe.sh. This script will load the modules and queue up our main script to be run by init.

#!/bin/sh

modprobe aoe
udevadm settle --timeout=30
/sbin/initqueue --settled --unique /sbin/aoe-up

3. Finally, we need aoe-up.sh, which is what is actually run by init to bring up the network interfaces and discover the AoE device being exported. It’s worth noting that I am setting the MTU to 9000 on each interface, which won’t necessarily be supported on your system. if you are unsure, remove the line “ip link set dev ${INTERFACES[$i]} mtu 9000”:

#!/bin/bash

PATH=/usr/sbin:/usr/bin:/sbin:/bin

exec >>/run/initramfs/loginit.pipe 2>>/run/initramfs/loginit.pipe

mkdir -p /dev/etherd
rm -f /dev/etherd/discover
mknod /dev/etherd/discover c 152 3

INTERFACES=(`ip link |grep BROADCAST |awk -F ":" '{print $2}' |sed 's/ //g' |sed 's/\n/ /g'`)
TOTAL_INTERFACES=${#INTERFACES[@]}

for i in `seq 0 $(($TOTAL_INTERFACES-1))`; do
    ip link set dev ${INTERFACES[$i]} mtu 9000
    ip link set ${INTERFACES[$i]} up
    wait_for_if_up ${INTERFACES[$i]}
done

ip link show

echo > /dev/etherd/discover

5. Now that we have written our AoE dracut module, we need to rebuild the initrd. Before we do this however, we need to make sure that dracut will pull in the modules we need (aoe and your NIC module). There are a few different ways to do this, but here are two options:

1. modprobe aoe && modprobe NIC_MODULE && dracut –force /boot/initramfs-KERNEL_VER.img KERNEL_VER
2. dracut –force –add-drivers aoe –add-drivers NIC_MODULE /boot/initramfs-KERNEL_VER.img KERNEL_VER

6. Now that we have our new initrd, we need to transfer the kernel and initrd to our tftp server

1. scp /boot/initramfs-KERNEL_VER.img /boot/vmlinuz-KERNEL_VER TFTP_SERVER:/path/to/tftp
2. Make sure that permissions are set correctly (should be chmod 644)

7. Before we shut down our CentOS 7 VM, there is one last bit of configuration we need to adjust. Since we require the network to be active to do anything with our FS, we need to make sure that on shutdown, the network isn’t brought down before the FS is unmounted. In CentOS 6, this was as easy as running `chkconfig –level 0123456 on`. CentOS 7 uses systemd, and so this solution will not work. Fortunately the solution is simple (and probably works in CentOS 6 too): Modify /etc/fstab, adding the option _netdev to each mount point that is being exported with AoE (e.g. change defaults to defaults,_netdev).
8. Now we need to get vblade (http://sourceforge.net/projects/aoetools/files/vblade/) and put it on the system exporting the LV with CentOS 7 on it. If you don’t want to install it onto your system, you can just make it and run it from where you extracted it. To export the disk, just run:

1. vbladed -b 1024 -m MAC 0 1 INTERFACE /path/to/disk (for information on what each command does, check out the vblade man page, which is included with the package)

And that’s it! I recommend at this point to run dracut on your netbooted hardware to make sure everything still loads (at this point you shouldn’t have to specifically install the AoE module or your NIC module, so make sure that this is true). Don’t forget to copy the newly created initrd to the TFTP server, and I recommend not overwriting your working initrd so that you can easily go back to a known working state.

What took me days to get working should now only take you an hour or so! I tried to be as detailed as I could, and I don’t think I left anything out, but if you have any issues, please let me know and I’ll update this guide accordingly.

Show the virtual machine name in dstat instead of showing qemu

Do you run dstat to watch Linux KVM hypervisors, but wish process names showed virtual machine names?  Me too.

This patch does just that:

--- a/usr/bin/dstat	2009-11-24 01:30:11.000000000 -0800
+++ b/usr/bin/dstat	2014-11-07 10:20:09.719148833 -0800
@@ -1946,6 +1946,12 @@
         return os.path.basename(name)
     return name

+def index_containing_substring(the_list, substring):
+	for i, s in enumerate(the_list):
+		if substring in s:
+			return i
+	return -1
+
 def getnamebypid(pid, name):
     ret = None
     try:
@@ -1956,6 +1962,10 @@
         if ret.startswith('-'):
             ret = basename(cmdline[-2])
             if ret.startswith('-'): raise
+        if any("qemu" in s for s in cmdline):
+            idx = index_containing_substring(cmdline, '-name')
+            if idx >= 0:
+                ret = cmdline[idx+1]
         if not ret: raise
     except:
         ret = basename(name)

Forcing insserv to start sshd early

Many distributions are using the `insserv` based dependency following at boot time.  After a bit of searching, I found very little actual documentation on the subject.  Here’s the process:

  1. Add override files to /etc/insserv/override/
  2. The files must contain ‘### BEGIN INIT INFO’ and ‘### END INIT INFO’, else insserv will ignore them.
  3. Some have indicated that you can override missing LSB fields with this method, however, it does require the Default-Start and Default-Start options even though you wouldn’t expect to need to override those.
  4. The name of the file in /etc/insserv/override must be equal to the name in /etc/init.d *not* the name it “Provides:”.  In an ideal world, the name would be the same as provides—but in this case that isn’t always so.

For my purpose, I created overrides for all of my services in rc2.d with this script.  Note that the overrides are just copies of the content  from the /etc/init.d/ scripts:

cd /etc/rc2.d
# This is one long line; $f is filename, $p is the Provides value.
grep Provides * | cut -f1,3 -d: | tr -d : | while read f p; do perl -lne '$a++ if /BEGIN INIT INFO/; print if $a; $a-- if /END INIT INFO/' $f > /etc/insserv/overrides/$p;done

Note that the script writes the filename from the “Provides” field so you may need to change the filename if you have initscripts where /etc/init.d/script doesn’t match the Provides field.  Notably, Debian Wheezy does not follow this for ssh.  Provides is sshd, but the script is named ssh.

Next, I append sshd to the Require-Start line of all of my overrides:

cd /etc/insserv/overrides/
perl -i -lne 's/(Required-Start.*)$/$1 sshd/; print' *

This of course creates a cyclic dependency for ssh, so fix that one up by hand.  Feel free to make any other boot-order preferences while you’re in the overrides directory.  For this case, ssh  was made dependent on netplug.

Finally, run `insserv` and double-check that it did what you expected:

# cat /etc/init.d/.depend.start
TARGETS = rsyslog munin-node killprocs motd sysfsutils sudo netplug rsync ssh mysql openvpn ntp wd_keepalive apache2 bootlogs cron stop-readahead-fedora watchdog single rc.local rmnologin
INTERACTIVE =
netplug: rsyslog
rsync: rsyslog
ssh: rsyslog netplug
mysql: rsyslog ssh
ntp: rsyslog ssh
[...snip...]

Viola!  Now I can ssh to the host far earlier, and before services that can take a long time to start to troubleshoot in case of a problem.  In my opinion, ssh should always run directly after the network starts.

-Eric

 

Tightening CentOS/RHEL Security

While there is far more to hardening a server than this single example, this is an often overlooked security issue in many default installations of RHEL and RHEL-based distributions (CentOS, Scientific Linux, etc.)

CentOS and RHEL come with the isdn4k-utils and coolkey packages installed by default for graphical workstations.  Unfortunately, these packages create world-writable directories which binaries and scripts may execute from.  While it is common to tighten /tmp, /var/tmp and /usr/tmp against execution attacks, these directories often go un-noticed.

If you do not use these utilities (and few servers do), they can be easily removed:

yum remove isdn4k-utils coolkey

Of course if you are using these, then you should find a way to secure these mountpoints with the noexec mount option.  This can be done with a loopback filesystem mounted atop the offending mountpoints or with separate LVM volumes for each.

Traditionally, /var does not run executable code so you could mount the entire /var mountpoint as noexec.  Its a great security practice if you can support this, however, there are some packages which expect to run their update scripts out of /var/tmp/ so be prepared to fix some broken package updates or installations.  When you do have a package error, simply mount /var as executable:

mount -o remount,rw,exec /var

install the package, and then disable execution on the mountpoint:

mount -o remount,rw,noexec /var

I recommend nosuid and nodev mount options for these types of mount points as well to restrict less common attack vectors.

-Eric

Bypassing the link-local routing table

Linux can use multiple routing tables, which is convenient for providing different routes for specific networks based on many different metrics, such as the source address.  For example, if we want to route traffic from 192.168.99.0/24 out the 172.17.22.1 default gateway, you could create a new table and route it as such:

# ip route add default via 172.17.22.1 dev eth7 table 100
# ip rule add from 192.168.99.0/24 lookup 100

Now imagine another scenario, where you wish to route traffic from 192.168.99.0/24 to an external network (the Internet), but 1.2.3.0/24 is (for some reason) link-local on your host.  That is, an address like 1.2.3.4 is directly assigned to an adapter on your host.  Linux tracks link-local connections through its ‘local’ routing table, and the ip rule’s show the preference order as:

# ip rule show
0:    from all lookup local 
32766:    from all lookup main 
32767:    from all lookup default

You might think deleting and adding the ‘local’ rule above with a higher preference and placing your new rule above it would fix the problem, but I’ve tried it—and it doesn’t.  Searching around shows that others have had the same problem.

So what to do?  Use fwmark.

First, change local’s preference from 0 to 100:

ip rule del from all pref 0 lookup local
ip rule add from all pref 100 lookup local

Next, mark all traffic from 192.168.99.0/24 with some mark, we are using “1”.  Note that I am using OUTPUT because 192.168.99.0/24 is my local address.  You might want PREROUTING if this is a forwarding host.

iptables -t mangle -s 192.68.99.0/24 -A OUTPUT -j MARK --set-mark 1

And finally add the rule that routes it through table 100:

# ip rule add fwmark 1 pref 10 lookup 100
# ip rule show
10:    from all fwmark 0x1 lookup 100
100:    from all lookup local
32766:    from all lookup main
32767:    from all lookup default

# ip route flush cache

Now all locally generated traffic to 1.2.3.0/24 from 192.168.99.0/24 will head out 172.17.22.1 on eth7 through table 100, instead of being looked up in the ‘local’ table.

Yay!

-Eric

 

 

Quickly fill a disk with random bits (without /dev/urandom)

When an encrypted medium is prepared for use, it is best practice to fill the disk end-to-end with random bits.  If the disk is not prepared with random bits, then an attacker could see which blocks have and have not been written, simply by running a block-by-block statistical analysis:  if the average 1/0 ratio is near 50%, its probably encrypted.  It may be simpler than this for new disks, since they tend to default with all-zero’s.

This is a well-known problem, and many will encourage you to use /dev/urandom to fill the disk.  Unfortuntaly, /dev/urandom is much slower than even rotational disks, let-alone GB/sec RAID on SSD’s:

root@geekdesk:~# dd if=/dev/urandom of=/dev/null bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 7.24238 s, 14.5 MB/s
root@geekdesk:~#

So how can we fill a block device with random bits, quickly?  The answer might be surprising:  we use /dev/zero—but write to the encrypted device.  Once the encrypted device is full, we erase the LUKS header with /dev/urandom.  The second step is of course slower, but we need only overwrite the first 1MB so it takes a fraction of a second.

Note that the password we are using (below) needn’t be remembered—in fact, you shouldn’t be able to remember it.  Use something long and random for a password, and keep it just long enough to erase the volume.  I use base64 from /dev/urandom for password generation:

# 256 random bits
dd if=/dev/urandom bs=1 count=32 | base64

Now format the volume and map it with luksOpen.  Note that we are not using a filesystem—this is all at block-layer:

root@geekdesk:~# cryptsetup luksFormat /dev/loop3
This will overwrite data on /dev/loop3 irrevocably.
Are you sure? (Type uppercase yes): YES
Enter LUKS passphrase: <random one-time-use password>
Verify passphrase:
root@geekdesk:~# cryptsetup luksOpen /dev/loop3 testdev
Enter passphrase for /dev/loop3: <same password as above>

root@geekdesk:~# dd if=/dev/zero of=/dev/mapper/testdev bs=1M
dd: writing `/dev/mapper/testdev': No space left on device
99+0 records in
98+0 records out
103804928 bytes (104 MB) copied, 1.21521 s, 85.4 MB/s

See, more than 6x faster (the disk is most likely the 85MB/s bottleneck)!  This will save hours (or days) when preparing multi-terabyte volumes.  Now remove the device mapping, and urandom the first 1MB of the underlying device:

# This line is the same as "cryptsetup luksClose testdev"
root@geekdesk:~# dmsetup remove /dev/mapper/testdev
root@geekdesk:~# dd if=/dev/urandom of=/dev/loop3 bs=512 count=2056
2056+0 records in
2056+0 records out
1052672 bytes (1.1 MB) copied, 0.0952705 s, 11.0 MB/s

Note that we overwrote the first 2056 blocks from /dev/urandom.  2056 is the default LUKS payload offset, but you can verify that you’ve overwritten the correct number of blocks using luksDump:

root@geekdesk:~# cryptsetup luksDump /dev/loop3
LUKS header information for /dev/loop3

Version:           1
Cipher name:       aes
Cipher mode:       cbc-essiv:sha256
Hash spec:         sha1
Payload offset:    2056   [...snip...]

Now your volume is prepared with random bits, and you may format it with any cryptographic block-device mechanism you prefer, safe knowing that an attacker cannot tell which blocks are empty, and which are in use (assuming the attacker has a single point-in-time copy of the block device).

I like LUKS since it is based on PKCS#11 and includes features such multiple passphrase slots and passphrase changes (it never reveals the actual device key, your passphrase unlocks the real key), but other volume encryption devices exist—or you might export the volume via iSCSI/ATAoE/FCoE and use a proprietary block-layer encryption mechanism.

If someone can explain an attack against this mechanism, I would be glad to hear about it.  In this example we used AES in CBC mode so we are spreading the IV bits across the entire volume.  Conceivably one could write an AES-CTR mode tool with a random key to do the same thing and this may be a stronger mechanism.  (To my knowledge, the dm-crypt toolchain does not have a CTR mode, nor would you want one for general use).

The method above fails when an attacker can tell the difference between the original AES-CBC wipe with random bits (where all plaintext bits are set to zero)—and the new encryption mechanism with a different key that will be used in production atop of the prepared disk volume.  While there may be an attack for AES-CBC with all-zero’s (though I don’t think there is), AES-CTR mode by its definition would make this method more effective since each block is independent of the next.  One might be able to argue that AES-CBC creates an AES-CTR mode implementation where the counter is a permutation of AES itself.  If this can be proven, then both methods are equally secure.

Either way, this is likely better (and definitely faster) than /dev/urandom for filling a disk, since /dev/urandom is a pseudo-random number generator.  Using /dev/urandom for terabytes of data may begin to develop a pattern once its effective entropy pool is spread too thin.  Even with seed-help from /dev/random, /dev/urandom might run out of steam.

In the end, random bits XORed with random bits still look like random bits when placed next to other random bits—but you’re welcome to debate this.  Yay for crypto!

ok, now back to work 🙂

-Eric

Edit: Mon Sep 17 19:43:22 PDT 2012
Come to think of it, you don’t even need a password at the luksFormat stage.  LUKS generates its own strong random bits for the actual block-cipher key.  The passphrase just unlocks that key.  For the purposes of wiping the disk with random bits, you can use “<enter><enter>” as your passphrase… just make sure you wipe the LUKS header in step 2 from /dev/urandom.

Block Device Replication with rdiff

I’ve written a few articles on rdiff-backup, and if you need an increment history to go back in time, rdiff-backup is your tool. 

But what if you just want to replicate a large block device over the Internet? Well, then we turn to the utility that inspired rdiff-backup: rdiff

For our example, we wil assume you are using LVM to create device snapshots—but really, this could be any snapshot or SAN flash implementation. I’ve just written it for Linux’s LVM.

  • /dev/remote-vg0/source will be the device we are replicating from
  • /dev/local-vg0/dest will be the device we are replicating to
  • remotehost is the system that hosts /dev/remote-vg0/source
  • This script is being executed on the destination system.
# Define our source and destination
# (Note: spaces in these paths could break the script)
SOURCE=/dev/remote-vg0/source
DEST=/dev/local-vg0/dest
SSHUSER=root@remotehost

# Choose a size large enough for the remote write-activity during
# replication
SOURCE_SNAPSHOT_SIZE=4G

# Must be the same size as $DEST, because rdiff writes in sequential
# order (it thinks the destination is an empty file, so it re-writes
# everything.)
#
# See Feb 18, 2011 update notes below.  This can be much smaller now if you 
# use the patch below, since writes are avoided unless necessary.
#DEST_SNAPSHOT_SIZE=50G

# This is probably safe with the librsync patch discussed below
DEST_SNAPSHOT_SIZE=$SOURCE_SNAPSHOT_SIZE

# Enable compression
SSHOPTS='-C'

# 32k I/O buffers, and 16k blocksize.
RDIFF_OPT='-I 32768 -O 32768 -b 16384 -s'

SOURCE_NAME=`basename "$SOURCE"`
SOURCE_SNAP="`dirname $SOURCE`/$SOURCE_NAME-snap"
SOURCE_SNAP_NAME="$SOURCE_NAME-snap"

DEST_NAME=`basename "$DEST"`
DEST_SNAP="`dirname $DEST`/$DEST_NAME-snap"
DEST_SNAP_NAME="$DEST_NAME-snap"

# remove the previous snapshots, if any
ssh $SSHOPTS "$SSHUSER" "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

# Snapshot the remote host:
ssh $SSHOPTS "$SSHUSER" "lvcreate -s -n '$SOURCE_SNAP_NAME' -L $SOURCE_SNAPSHOT_SIZE '$SOURCE'"

# Snapshot the local destination host:
lvcreate -s -n "$DEST_SNAP_NAME" -L $DEST_SNAPSHOT_SIZE "$DEST"

rdiff $RDIFF_OPT -- signature "$DEST_SNAP" - | \
  ssh $SSHOPTS "$SSHUSER" "rdiff $RDIFF_OPT -- delta - '$SOURCE_SNAP' -" | \
  rdiff $RDIFF_OPT -- patch "$DEST_SNAP" - "$DEST"

# Compare the volumes, if you like
ssh $SSHOPTS "$SSHUSER" "md5sum '$SOURCE_SNAP'"
md5sum $DEST

# cleanup, remove the snapshots.
ssh $SSHOPTS $SSHUSER "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

This is a convenient single-pipe process for replication, and it uses the librsync rolling-checksum process, using minimal bandwidth on the network that ssh traverses.

Executing this script yields something like this on a 50GB volume; note that the md5sum’s match perfectly.

  Logical volume "source-snap" created
  Logical volume "dest-snap" created
rdiff: signature statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: loadsig statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: delta statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 372798 false, 10746386 cmdbytes]
rdiff: patch statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 0 false, 10746386 cmdbytes]
7fddc578cdbf5f4e30b7f815e72acebd  /dev/local-vg0/dest
7fddc578cdbf5f4e30b7f815e72acebd  /dev/remote-vg0/source-snap
  Logical volume "source-snap" successfully removed
  Logical volume "dest-snap" successfully removed

Since rdiff does not know the destination is a snapshot of the basis-file, it rewrites the whole thing. Indeed, it could simply seek instead of copy from the basis-file, but the stock rdiff tool does not support this. If I write a patch, it will get posted here—and if you write a patch, please let me know! (see update below!)

Until then, keep double the space free in your volume group that you need to run a snapshot and it should work great!

Update: Fri Feb 18 12:39:51 PST 2011 I just wrote patch for rdiff (within the librsync package) that updates in place, by patching the file-stream-sink code in buf.c. Basically, it reads before writing. If the data read in is the same that it would have written, it skips the write and advances the write pointer; otherwise, it writes as normal. Since this avoids writing to the device that had a snapshot except where necessary, much less snapshot-backing-store is required. This code passes all of the ‘make check’ tests that come with librsync, and I believe it to be stable. On my system, rdiff syncs are about 2x faster due to the much reduced write-overhead of the original implementation.

  • The patch is here
  • and the patched code, ready to compile, is here

-Eric

Digium Multi-link PPP Support: DAHDI T1 Bonding

Getting started

I have had quite the challenge recently using Digium’s multi-port T1 cards for link bonding. The plan is to have multiple links from multiple cards go to multiple locations and provide aggregate bandwidth and fault tolerence. This means I wish to unplug any wire from any bonded link and—except for reduced bandwidth from the missing link—the network should continue to operate as if nothing happened: telephone conversations must continue and open TCP sockets must stay open.

Prior Work

I found this guide in my search for multi-link PPP, however, it was written for the Zaptel device driver, which was renamed to DAHDI due trademark issues back in 2008. This means the Zaptel documentation is at least 2 years old and DAHDI has come a long way since then.

Still, this guide is nearly sufficient to provide redundant links. The challenge one might experience in using this Zaptel guide on modern DAHDI drivers is the lack of detailed documentation. Thus, the motivation for this article.

Sangoma, a competing T1/E1/J1 card manufacturer released a modified version of pppd to manage multi-link PPP under Linux in a more reliable way. From reading the code, it appears that Sangoma modified pppd such that it will exit if it looses its multilink bundle—and uses a wrapper script, pppmon, to restart the daemon upon failure.

This is not ideal, since we would like the pppd daemon to keep the pppX interface up even if the multilink bundle drops in order to keep routes in place, as the Linux kernel will drop all routes through an interface when that interface is removed. As it turns out, this is a much more difficult challenge than it sounds.

The Case of the Missing Route

There are two complete-failure scenarios (that is, multilink-bundle-failure complete) that we would like to seamlessly recover from:

  • Both remote servers stay up, but all multipoint links drop (ie, unplug/replug).
  • One pppd daemon drops (perhaps the endpoint reboots), and the other side stays on.

The first does does not need LCP negotiation, and existing PPP state can continue upoon recovery. Simply using the pppd persist argument is sufficient here after removing Sangoma’s “exit-on-error” logic, however, this presents a challenge for the latter scenario:

If the side that remains available (ie, turned on) keeps its state, then the remote side will attempt to LCP handshake when it reboots (or re-launches pppd). Since the existing side is assuming existing state, it does gets confused by LCP request frames coming down the PPP link, and the link becomes inoperable.

Thinking, “well, why not just re-initialize the link whenever a multilink-bundle fails” I added new_phase(PHASE_INITIALIZE) at the point that the multipoint bundle is lost; this is nearly the same as re-executing pppd, but it keeps the associated pppX interface—and its associated routes—alive and kicking. This worked well when the remote-end reboots complpetely but, then, the first failure scenario of unplug/replug does not recover: The DAHDI pppd plugin attempts to re-initialize the master channel carrying the PPP link and throws “device or resource busy” errors.

The “Solution”

It turns out the easiest “fix” for this is to write a wrapper shell script around pppd with no-persist and no-fork. You can background the scripts and manage them in a hand-crafted way or perhaps a SysV script. I added the config to the end of /etc/rc.local using the “hub server” to assign IP address, allowing all of the clients to dynamically pick up IP addresses. This means any tech can plug a unit in and it will train up with the proper addressing no matter which port the T1 lines are plugged into. I also used the watchdog daemon to reboot the PPP router if it hasn’t gotten a ping response in X seconds, X*2 seconds after having booted.

This is perhaps not ideal since it drops routes when pppX is downed, but the watchdog will reboot the system if access is really lost.

Future Work

It would be great if someone were to patch the mainline pppd to support graceful recovery from the two failure scenarios listed above—without bring the pppX interface down and losing the network routes. Email me if you’re curious and I will point you in the right direction. After a few hours poking at the pppd code, this change may not be trivial since multi-link PPP apears to be a hack into pppd at the moment.

-Eric