Sorting out backup internet #3: failover

With local recursive DNS and a 5G modem in place the next thing was to work on some sort of automatic failover when the primary FTTP connection failed. My wife works from home too and I sometimes travel so I wanted to make sure things didn’t require me to be around to kick them into switch the link in use.

First, let’s talk about what I didn’t do. One choice to try and ensure as seamless a failover as possible would be to get a VM somewhere out there. I’d then run Wireguard tunnels over both the FTTP + 5G links to the VM, and run some sort of routing protocol (RIP, OSPF?) over the links. Set preferences such that the FTTP is preferred, NAT v4 to the VM IP, and choose somewhere that gave me a v6 range I could just use directly.

This has the advantage that I’m actively checking link quality to the outside work, rather than just to the next hop. It also means, if the failover detection is fast enough, that existing sessions stay up rather than needing re-established.

The downsides are increased complexity, adding another point of potential failure (the VM + provider), the impact on connection quality (even with a decent endpoint it’s an extra hop and latency), and finally the increased cost involved.

I can cope with having to reconnect my SSH sessions in the event of a failure, and I’d rather be sure I can make full use of the FTTP connection, so I didn’t go this route. I chose to rely on local link failure detection to provide the signal for failover, and a set of policy routing on top of that to make things a bit more seamless.

Local link failure turns out to be fairly easy. My FTTP is a PPPoE configuration, so in /etc/ppp/peers/aquiss I have:

lcp-echo-interval 1
lcp-echo-failure 5
lcp-echo-adaptive

Which gives me a failover of ~ 5s if the link goes down.

I’m operating the 5G modem in “bridge” rather than “router” mode, which means I get the actual IP from the 5G network via DHCP. The DHCP lease the modem hands out is under a minute, and in the event of a network failure it only hands out a 192.168.254.x IP to talk to its web interface. As the 5G modem is the last resort path I choose not to do anything special with this, but the information is at least there if I need it.

To allow both interfaces to be up and the FTTP to be preferred I’m simply using route metrics. For the PPP configuration that’s:

defaultroute-metric 100

and for the 5G modem I have:

iface sfp.31 inet dhcp
    metric 1000
    vlan-raw-device sfp

There’s a wrinkle in that pppd will not replace an existing default route, so I’ve created /etc/ppp/ip-up.d/default-route to ensure it’s added:

#!/bin/bash

[ "$PPP_IFACE" = "pppoe-wan" ] || exit 0

# Ensure we add a default route; pppd will not do so if we have
# a lower pref route out the 5G modem
ip route add default dev pppoe-wan metric 100 || true

Additionally, in /etc/dhcp/dhclient.conf I’ve disabled asking for any server details (DNS, NTP, etc) - I have internal setups for the servers I want, and don’t want to be trying to select things over the 5G link by default.

However, what I do want is to be able to access the 5G modem web interface and explicitly route some traffic out that link (e.g. so I can add it to my smokeping tests). For that I need some source based routing.

First step, add a 5g table to /etc/iproute2/rt_tables:

16  5g

Then I ended up with the following in /etc/dhcp/dhclient-exit-hooks.d/modem-interface-route, which is more complex than I’d like but seems to do what I want:

#!/bin/sh

case "$reason" in
    BOUND|RENEW|REBIND|REBOOT)
        # Check if we've actually changed IP address
        if [ -z "$old_ip_address" ] ||
           [ "$old_ip_address" != "$new_ip_address" ] ||
           [ "$reason" = "BOUND" ] || [ "$reason" = "REBOOT" ]; then
            if [ ! -z "$old_ip_address" ]; then
                ip rule del from $old_ip_address lookup 5g
            fi
            ip rule add from $new_ip_address lookup 5g

            ip route add default dev sfp.31 table 5g || true
            ip route add 192.168.254.1 dev sfp.31 2>/dev/null || true
        fi
    ;;

    EXPIRE)
        if [ ! -z "$old_ip_address" ]; then
            ip rule del from $old_ip_address lookup 5g
        fi
    ;;

    *)
    ;;
esac

What does all that aim to do? We want to ensure traffic directed to the 5G WAN address goes out the 5G modem, so I can SSH into it even when the main link is up. So we add a rule directing traffic from that IP to hit the 5g routing table, and a default route in that table which uses the 5G link. There’s no configuration for the FTTP connection in that table, so if the 5G link is down the traffic gets dropped, which is what we want. We also configure 192.168.254.1 to go out the link to the modem, as that’s where the web interface lives.

I also have a curl callout (curl --interface sfp.31 … to ensure it goes out the 5G link) after the routes are configured to set dynamic DNS with Mythic Beasts, which helps with knowing where to connect back to. I seem to see IP address changes on the 5G link every couple of days at least.

Additionally, I have an entry in the interfaces configuration carving out the top set of the netblock my smokeping server is in:

    up ip rule add from 192.0.2.224/27 lookup 5g

My smokeping /etc/smokeping/config.d/Probes file then looks like:

*** Probes ***

+ FPing

binary = /usr/bin/fping

++ FPingNormal

++ FPing5G

sourceaddress = 192.0.2.225

+ FPing6

binary = /usr/bin/fping

which allows me to use probe = FPing5G for targets to test them over the 5G link.

That mostly covers the functionality I want for a backup link. There’s one piece that isn’t quite solved, however, IPv6, which can wait for another post.