Digium Multi-link PPP Support: DAHDI T1 Bonding

Getting started

I have had quite the challenge recently using Digium’s multi-port T1 cards for link bonding. The plan is to have multiple links from multiple cards go to multiple locations and provide aggregate bandwidth and fault tolerence. This means I wish to unplug any wire from any bonded link and—except for reduced bandwidth from the missing link—the network should continue to operate as if nothing happened: telephone conversations must continue and open TCP sockets must stay open.

Prior Work

I found this guide in my search for multi-link PPP, however, it was written for the Zaptel device driver, which was renamed to DAHDI due trademark issues back in 2008. This means the Zaptel documentation is at least 2 years old and DAHDI has come a long way since then.

Still, this guide is nearly sufficient to provide redundant links. The challenge one might experience in using this Zaptel guide on modern DAHDI drivers is the lack of detailed documentation. Thus, the motivation for this article.

Sangoma, a competing T1/E1/J1 card manufacturer released a modified version of pppd to manage multi-link PPP under Linux in a more reliable way. From reading the code, it appears that Sangoma modified pppd such that it will exit if it looses its multilink bundle—and uses a wrapper script, pppmon, to restart the daemon upon failure.

This is not ideal, since we would like the pppd daemon to keep the pppX interface up even if the multilink bundle drops in order to keep routes in place, as the Linux kernel will drop all routes through an interface when that interface is removed. As it turns out, this is a much more difficult challenge than it sounds.

The Case of the Missing Route

There are two complete-failure scenarios (that is, multilink-bundle-failure complete) that we would like to seamlessly recover from:

  • Both remote servers stay up, but all multipoint links drop (ie, unplug/replug).
  • One pppd daemon drops (perhaps the endpoint reboots), and the other side stays on.

The first does does not need LCP negotiation, and existing PPP state can continue upoon recovery. Simply using the pppd persist argument is sufficient here after removing Sangoma’s “exit-on-error” logic, however, this presents a challenge for the latter scenario:

If the side that remains available (ie, turned on) keeps its state, then the remote side will attempt to LCP handshake when it reboots (or re-launches pppd). Since the existing side is assuming existing state, it does gets confused by LCP request frames coming down the PPP link, and the link becomes inoperable.

Thinking, “well, why not just re-initialize the link whenever a multilink-bundle fails” I added new_phase(PHASE_INITIALIZE) at the point that the multipoint bundle is lost; this is nearly the same as re-executing pppd, but it keeps the associated pppX interface—and its associated routes—alive and kicking. This worked well when the remote-end reboots complpetely but, then, the first failure scenario of unplug/replug does not recover: The DAHDI pppd plugin attempts to re-initialize the master channel carrying the PPP link and throws “device or resource busy” errors.

The “Solution”

It turns out the easiest “fix” for this is to write a wrapper shell script around pppd with no-persist and no-fork. You can background the scripts and manage them in a hand-crafted way or perhaps a SysV script. I added the config to the end of /etc/rc.local using the “hub server” to assign IP address, allowing all of the clients to dynamically pick up IP addresses. This means any tech can plug a unit in and it will train up with the proper addressing no matter which port the T1 lines are plugged into. I also used the watchdog daemon to reboot the PPP router if it hasn’t gotten a ping response in X seconds, X*2 seconds after having booted.

This is perhaps not ideal since it drops routes when pppX is downed, but the watchdog will reboot the system if access is really lost.

Future Work

It would be great if someone were to patch the mainline pppd to support graceful recovery from the two failure scenarios listed above—without bring the pppX interface down and losing the network routes. Email me if you’re curious and I will point you in the right direction. After a few hours poking at the pppd code, this change may not be trivial since multi-link PPP apears to be a hack into pppd at the moment.

-Eric