Setting up high-availability failover mode

LAN model UCARP-based failover

Access Server comes with a built-in failover mode which can be deployed on a local area network. It is designed to allow one primary node to handle all the tasks, and if it fails, to let a secondary standby node come online automatically and take over the tasks from the failed node. This is done with a method called UCARP using VRRP heartbeat network packets. The two nodes work to keep a single virtual IP address online. Normally the primary node manages this alone but when it goes down the secondary takes over by becoming the new primary node.

Function description

Two servers will be deployed on a local network with private IP addresses, or with public IP addresses, in an environment where VRRP/UCARP traffic can travel from one node to the other without interference. Amazon AWS is ruled out because they block this traffic. So you simply cannot use this failover method on Amazon AWS.

The primary node will normally be the node that handles all traffic on a third shared virtual IP address (alias on an existing interface), while the other, the failover node, stays dormant until it notices that the VRRP/UCARP heartbeat signal that the primary node sends out on the network, has ceased. If this last more than a few seconds, the failover node will take over. This happens very quickly, in the order of 5 to 10 seconds, maybe a bit more, depending on how fast your server is at starting the Access Server service and how much data is in the configuration, certificates, and user properties databases.

Clients will be momentarily interrupted by the failure of the primary node, they will notice of course that their encrypted sessions are not receiving any new data and sending new data arrives at the failover node that doesn't know what to do with it and discards it. After a timeout, usually 30 to 60 seconds, the client decides that the connection has failed and will reconnect. It will try to use the session token from the previous connection to the primary node, and authenticate itself with that. The failover node should recognize this session token and accept it and allow the client to reconnect automatically. Auto-login profiles do not need this as they don't work with sessions and they just reconnect automatically. All in all, an interruption of about a minute is to be expected, and in almost all cases, connectivity should automatically restore. The client connection profiles on one node will be accepted by the other one - the data is synchronized.

An exception regarding auto-reconnect is when the client does not use session tokens for a user-locked profile, or has only just in the last 30 to 60 seconds before the primary node failed, established a connection and received a session token. In such a case there was no time for the primary node to relay the session information to the failover node. And of course if the session token has just expired, that also prevents an automatic reconnect. There may be other rare situations where an automatic reconnect fails, mainly due to the fact that there are so many versions of OpenVPN clients out there, but on the whole, things work as just described.

When you make changes to the configuration of the active node in an Access Server failover node, and need to refresh the configuration in the running server and you use the update running servers button, it will only trigger a reload on the active Access Server and not cause a failover event. Clients that are affected by the configuration change will be asked to disconnect and reconnect by themselves. Clients that are not affected by the changes will remain connected. Configuration changes on one node, are automatically copied to the other one. For example with the primary node online and the secondary node in its dormant standby state, the primary node dumps a copy of configuration databases to a separate location on the secondary node's file system. When the primary node fails and the secondary node needs to come online, it will load that dump and then start up. After a failover event it will then be up-to-date with the latest information that the primary node was able to copy to the secondary node.

After a failover event, the roles will be reversed. The secondary node will be the active node now, taking over the virtual IP and handling all the traffic now. If the primary node has suffered a failure and spontaneously rebooted, it will not automatically take over again. It will now instead go into a dormant standby state and accept database configuration dumps from the active secondary node now. The roles have been reversed. We do this on purpose, because if the primary node has failed, it is possible that it could go into a reboot cycle due to hardware failure, and we don't want to cause failover events in that situation. So when a failover event has occurred, you will have to manually intervene if you want the primary node to be the active master node again. To do so, simply ensure the primary node is running normally, and then restart the Access Server service on the secondary node, or reboot it. The primary will then take over again.

Platform compatibility

This method unfortunately does not work on all platforms. For example on Amazon AWS, broadcast UCARP/VRRP traffic is simply filtered away, so this model cannot be used on Amazon AWS.

  • Physical servers should work just fine on physical networks.
  • Microsoft HyperV and VMWare ESXi are supported, but you may need "MAC spoofing" or "Promiscuous mode".
  • Other virtualized platforms should also work as long as it's a local network where broadcast UCARP/VRRP is possible.
  • Amazon AWS is not supported, because the heartbeat signal is filtered away on their networks.
  • If one node is in a different network from the other node, this failover model can almost certainly not be used.
  • If multiple UCARP/VRRP failover pairs are present in the same network, you must adjust the VHID to be unique.

That last point requires further explanation. The VHID is a number that is sent along in the heartbeat signal that goes onto the local network. The secondary node monitors this heartbeat signal. If there are multiple UCARP/VRRP systems online at the same time in the same network, multiple such heartbeat signals can be seen. To know which one the secondary node has to deal with, the heartbeat signal has a unique number. By default on an Access Server failover pair setup this number is 94. You can adjust the VHID on the command line to ensure that each failover pair running in the same LAN network recognizes its partner node properly.

First steps in setting up the primary node

This part is the same as setting up a normal OpenVPN Access Server installation on a private network. You will need a supported Linux operating system with a private static IP address. We have some technical documentation on how to set a static IP address on a Linux installation here, if you need it. Some networks work with a DHCP server with a static IP address assignment for DHCP clients, and if you have that configured and working, then that is also acceptable. Since you will be running the Access Server failover pair inside of a private network, if you want people from the Internet to reach it, you will need to set up port forwarding in the gateway system on this network that leads to the Internet. For initial testing you can forward ports TCP 443, TCP 943, and UDP 1194, to the static IP address of your primary node. This way you can set up your Access Server and get it reachable and working from the outside. Later, you should direct the port forwards to the virtual IP chosen for your failover setup instead. You would ideally have a DNS record set up that points to the public IP address of your Internet gateway system that leads to the Access Server, and you would have this configured in the Access Server's Network Settings page in the host name or IP address field. This field contains the address clients will try to connect to. A DNS name allows for easy updates if the public IP of the server ever changes in the future, and it also makes it possible for a proper SSL certificate to be installed.

You will need the program rsync present on your primary node. Install it:

apt-get update
apt-get install rsync

The program rsync is used to transfer configuration backups, user certificates, and user properties, from the primary node to the secondary node. In the event of a failover, the secondary node loads these backups and goes online and takes over the tasks from the failed node with this up-to-date information.

Preparing the failover node for use

We are going to assume you have a server already set up as the primary node, as described in the section above.

To set up the secondary node, simply do a new deployment of Access Server. It doesn't matter if you have it as an appliance or virtual image or an installation manually on Linux. You do not need to configure all the settings of the Access Server, just get it to the point where you can get to the command line and the Access Server package installer file is installed. Next set up a static IP address for this node as well, just like the primary node, but a different IP address obviously. You do not need to do port forwarding to this node. Get root permissions on the server you are going to use as secondary node and run the following destructive command on it to clear all its settings and prepare it for use as a secondary node.

Prepare the secondary node for its role as a failover system:

ovpn-init --secondary

You will have to manually confirm this step by typing the word DELETE to confirm that you want to wipe this server's settings and set it up as a failover node. It goes without saying that this step wipes this particular node of all of its settings, so if this is a production node and it contains data that you want to keep, obviously do not demote this node to a failover role, but instead set up a new failover node. If you want to automate this command completely so it doesn't ask confirmation then you can add the parameters --batch and --force to it.

You will need the program rsync present on your secondary node. Install it:

apt-get update
apt-get install rsync

Set up bi-directional SSH access

Currently the Access Server needs the ability to have root level access to the partner node in order to configure things and to keep the settings updated. There are two ways to go about this. One way is to use passwordless SSH keys which are automated and fairly secure, or you can enable root user login directly through SSH with a password, but this is not considered secure. We are therefore going to focus here on the passwordless SSH key setup.

We are going to make a number of assumptions in this guide and you should adjust for your situation as necessary:

  • We are assuming that you cannot login with the username root via an SSH connection.
  • We assume that you do have the ability to login through SSH with a user other than root, and that with the use of the command 'sudo su', you can gain root privileges.
  • In our guide we assume that this non-root user is called simply sshuser.
  • We assume that 192.168.70.1 is your primary node's IP address.
  • We assume that 192.168.70.2 is your secondary node's IP address.
  • We assume that 192.168.70.3 is the shared virtual IP that your failover pair will work to keep online at all times.
  • That you are logged on through SSH and have now obtained root privileges on both nodes.
  • All commands below are assumed to be run as the root user.

Log on to both nodes and run these commands on both nodes:

mkdir ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa -f id_rsa -P ""
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys

This creates SSH access keys that require no password to login. But they need to transferred to their partner node and put into the correct place so the nodes know when and how to use them for direct SSH access without the need to login with credentials.

On the primary node, copy the key to the secondary node:

/usr/bin/ssh-copy-id -i ~/.ssh/id_rsa.pub sshuser@192.168.70.2

And vice-versa, on the secondary node, copy the key to the primary node:

/usr/bin/ssh-copy-id -i ~/.ssh/id_rsa.pub sshuser@192.168.70.1

You will likely have to confirm that you want to make a connection for the purpose of copying the SSH access key to its partner node. You will have to enter the password of the user sshuser to complete the transfer.

Once this copy process is done, the keys are in the wrong place. Run this command on both nodes to put the SSH access keys in the correct place for root access:

cat /home/sshuser/.ssh/authorized_keys >> /root/.ssh/authorized_keys

To test that is working try to establish an SSH connection from the primary node to the secondary node by only typing:

ssh root@192.168.70.2

If this works, that means the passwordless SSH key setup has succeeded. You should test the other direction as well, from secondary node to primary node.

Configure the failover function

Log on to your primary node's admin UI web interface, and go to the failover page. Switch on the LAN model (UCARP-based failover) option and then enter the shared virtual IP that you want both nodes to try to keep online at all times, and enter the IP address of your primary node and your failover node. Assuming you used the passwordless SSH key setup described in the section above, you do not need to alter any of the other values. Now select the Validate option and let the Access Server check the connection. If all is well you should see a good result. You can then use the Commit and Restart button to commit the changes.

Once the changes have been committed, the primary node's Access Server service will automatically restart itself and go online as the primary node in failover mode. It will bring online the virtual shared IP address (192.168.70.3 in our example) and offer its services there. The secondary node will go into a standby node and no longer offer a web service or VPN service at its configured static IP address. It will simply standby, wait for a failure of the primary node, and if the primary node has failed, it will take over the role of the primary node automatically and go online and offer a web service and VPN service and handle incoming connections just like the failed node would have.

You should now update your port forwarding settings to ensure that it goes to the shared virtual IP address (192.168.70.3 in our example). Your failover setup is now functional. You may test it by for example shutting down the primary node, and checking to see if your failover node now becomes the primary node. You can observe the /var/log/openvpnas.log and /var/log/openvpnas-node.log files to observe the state changes and you can also of course observe it by opening the public address of your Access Server's web interface and checking to see if it responds once the primary node has been shut down.

Finally, you should look into the licensing status of your servers.

Activate license on LAN model failover pair

Licensing in this model is of a special courtesy type. Both nodes will need a license key by itself. This failover model will only ever allow one of the two nodes to be actively handling VPN tunnel connections, and the other node will be in a standby mode. In this situation, we require that the primary node has a valid purchased license key as with a normal Access Server setup, but for the failover node we will, on request and at our discretion, provide a matching failover license key that is suitable only for the failover node in a LAN model UCARP-based failover pair. Simply submit a support ticket to us requesting a failover license key, and provide the primary key that we should match it against. Our reasoning here is simple; in this failover model it is impossible to have both nodes actively handling connections at the same time. Therefore we are not giving away free licenses, we are only enabling you to run a failover setup by giving away failover licenses that will be left in an unused state on a dormant server until the primary node becomes unavailable, and the failover node needs to take over. We say unused because a failover node cannot be actively handling connections at the same time as the primary node.

License activation must be done differently on a failover pair. We recommend that you log on through SSH to the node by its static individual IP address (and not the shared virtual IP address) where you wish to activate a license key, obtain root privileges, and then do the license key activation on the command line with this command:

/usr/local/openvpn_as/scripts/liman activate "LICE-NSEK-EYIN-HERE"

You can verify the result with this command:

/usr/local/openvpn_as/scripts/liman info

Troubleshooting

If you experience the situation where both nodes simultaneously try to be a MASTER node, or primary node, then your nodes may simply not be able to communicate with each other using VRRP heartbeat signals. There is a way to find out for certain if this is the case. An active primary node will send our VRRP packets onto the network, a secondary node in standby mode will not. If for example you were to stop the Access Server service on the secondary node, the primary node should in theory be online as primary node, and be the MASTER in the network, and should then be sending out VRRP packets that are visible to the secondary node. So for testing purposes, stop the Access Server service on the secondary node and use tcpdump to look if the VRRP packets arrive at the secondary node.

Stop the Access Server service on the secondary node:

service openvpnas restart

Install tcpdump on the secondary node:

apt-get update
apt-get install tcpdump

Use tcpdump to look for VRRP packets:

tcpdump -eni any vrrp

Example output:

18:15:53.000605 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36
18:15:54.000718 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36
18:15:55.000802 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36

If you do not see VRRP packets arriving there's a very good chance your network equipment is blocking the VRRP packets. In that case you should try to find a way to resolve that. If your network is incapable of passing these VRRP packets, then unfortunately you cannot use the LAN model UCARP-based failover model of the OpenVPN Access Server product.