IP Failover (proof of concept)
Essentially the idea is that if your primary web server goes down, your backup server automatically takes over (failover). When your primary server is back online, it takes over again (failback). It sounds simple, but it can be very complicated and expensive to eliminate all the single points of failure – you could need multiple redundant routers, switches, servers, power supplies, UPS and storage all on separate networks.
The simple setup I am proposing uses a minimum of two servers; the primary and the backup, both on separate networks.
An alternative to IP failover (where the IP is changed) is IP takeover (where the backup server actually takes on the IP address of the failed primary), but for this to work, the IP addresses need to have the same router, so would need to be at the same web host. That’s fine if the primary server failure issue is local hardware, but if it’s a power failure in the datacentre or a problem with the network, a switch, a router or any connection problems, transit or peering, then both servers would have the same problem, so it makes more sense to make them geographically disparate on entirely separate networks.
It should be noted that IP failover services are provided by sites like dnsmadeeasy.com and zoneedit.com – this would be much simpler to set up, but of course you pay for the privilege.
There are two options for the backup server. Either it just displays a “service currently not available” type page (simple to set up) or it is a complete mirror of the primary server. For my purposes it is a status page, but it is perfectly possible to replicate the primary server if necessary – you could run rsync (preferably through an SSH tunnel with a key pair), something along the lines of “rsync -avz –delete -e “ssh -i /root/rsync/mirror-rsync-key” /home/website user@server.com:/home/website” run every few minutes by cron – with MySQL replication for the databases, for example. Google has multiple articles and how-tos for both scenarios.
A possible issue with a replicated server is if users change files via FTP or change the database using a database script when the backup server is active, you would either have to prevent them doing so or make sure you mirror those changes back to the primary server when it comes back online. Unless of course you are using a different server for the database and shared storage for the web files.
This is how I propose to do it: there’s a script on the backup server that monitors the primary server (a heartbeat-type service, run as a PHP script every few minutes by cron). The script can optionally check with any other spare servers to see if they can contact the primary server (to make sure it is definitely down). If the primary server is definitely down, the backup server updates the DNS entries for the server domain(s) to point to itself (with a low TTL in case the primary comes back online).
It’s simplest if the primary server is also the primary DNS server and the backup server is the secondary DNS server, then if a user cannot connect to the primary server, it also cannot get the incorrect DNS records (assuming the whole server is down, not just the web service, which the heartbeat script should check).
Major caveat: some ISPs may ignore the TTL of DNS records and cache the wrong results for too long, but nothing can be done about that.
Updating the DNS records could be done using the nsupdate command, but there may be issues with permissions both to run the program and for the server to update the DNS records, so it’s simpler if the backup is running Virtualmin, which comes with a remote API which can be called from a PHP script.
This is the flowchart of what happens:
BACKUP checks PRIMARY is up every x mins and loads previous state from database:
> PRIMARY was up and is still up – do nothing
> PRIMARY was up and is now down:
> check with SPARE servers:
> cannot contact SPARES – internet probably down at BACKUP – do nothing
> at least 1 SPARE reports PRIMARY up – network issues – do nothing
> otherwise – update DNS and log to database
> PRIMARY was down and is still down – do nothing
> PRIMARY was down and is back up – update DNS and log to database
This is how the PHP script could update the DNS records:
<?php
//if primary down, failover IP addresses, remove www record then add new one:
$result1 = shell_exec("wget -O - --quiet --http-user=root --http-passwd=root_pass --no-check-certificate 'https://www.backupdomain.com:10000/virtual-server/remote.cgi?program=modify-dns&domain=primarydomain.com&remove-record=www.primarydomain.com. A'");
$result2 = shell_exec("wget -O - --quiet --http-user=root --http-passwd=root_pass --no-check-certificate 'https://www.backupdomain.com:10000/virtual-server/remote.cgi?program=modify-dns&domain=primarydomain.com&ttl=60&add-record=www.primarydomain.com. A 1.2.3.4'");
//echo $result2; //should end with: Exit status: 0 if successful
// if primary back online, reverse the process (though the primary as the primary DNS server should eventually update the record on the secondary DNS server anyway)
?>
I’ll post more actual examples when I get around to implementing this 🙂