- Nah, it's totally doable. With HAProxy.
BIG-IP F5 series. Unfortunately, using a hardware solution does not provide the flexibility we need. We constantly want to change stuff like pool members, create new pools, manage IP quotas and kick unwanted or abusive traffic from our servers. Plus, if you plan to switch from HTTP to HTTPS entirely, performance also becomes an issue.
So we decided to go for it and implement our own solution. We chose HAProxy and did not regret it. The reasons are manifold but mostly we were looking for open source, speed, and scalability.
InstallationTo start, get your favourite Linux flavour and install HAProxy. We use Red Hat 7.1 on an "old" Server that has two Intel Xeon X5560@2,8 GHz, resulting in 16 logical cores. Disks are ordinary (non-SSD) disks and the server has 6x1-Gigabit Network cards on board.
Once installed, the HAProxy configuration is changed to something like this:
You have just created a high-performance load balancer for willhaben.at. For us, however, it did not end here. We had to tackle some demanding problems before we could even think of going productive with our solution. But enough boring installation procedures, here comes the really interesting part.
Obviously, a 1GBit link is not sufficient. Enthusiastically, we decided to go with 10GBit interfaces. However, we didn't expect how ridiculously expensive these things are. Especially when you need four of them, including a fiber-linked switch behind them. So, no 10GBit solution.
Instead, we use four of our six 1Gb interfaces to create a 802.3ad conform network bond.
The configuration from /etc/sysconfig/network-scripts/ifcfg-bond0 looks like this:
BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2"
The last line is the key part, where mode 4 (802.3ad) is defined. This line also defines the algorithm that decides which interface to use for outgoing traffic. The xmit_hash_policy creates a hash for a connection and divides it modulo the interface number (4 in our case). Per default, the fastest hash type is used and that is layer 2 only. We had to switch this to layer 2+3 because MAC addresses on VLANs do not change. We would always get the MAC of the switch/router and therefore end on the same interface. The same setting also needs to be done on the switch for incoming traffic.
With this setup we achieve both, network redundancy and a decent traffic distribution:
What? We are getting more than 1Gb/s already? Nice catch. But note that this is overall traffic. When the balancer uploads an image to the client, for example, it needs to download it from the server first. And since all interfaces are full duplex, a Byte can be sent and received at the same time.
Alright. First Problem solved. We have enough bandwidth, yay!
RedundancyNow that we have a high-bandwitdh load balancer, we don't want it to be the sole bottleneck for the whole platform. We need redundancy. Basically, there are two possibilities to achieve it:
- Two active nodes: In this case we need a public IP address for each node and broadcast both via DNS records. The advantage is that we would get additional balancing. A disadvantage is that we need to start considering client-side mechanics when debugging a problem.
- One active node with hot standby: Here, we simply clone the existing machine and if the master fails, the slave takes over.
We implemented the master/slave design with keepaliveD. This deamon uses vrrdp as a protocol to syncronise IP addresses between hosts. If the master goes down, all IPs are simply transferred to the slave. So what we have until now is something like that:
The keepalived configuration is pretty straight-forward:
In a nutshell: All IP addresses from the virtual_ipaddress section are transferred to the slave if the script check_integrity.sh fails. The only thing left to do is add net.ipv4.ip_nonlocal_bind=1 to /etc/sysctl.conf. It allows a service to bind to a socket, which consists of an IP+Port combination, that does not yet exist. This is necessary to start the slave without its assigned IPs.
And that's another job well done. Redundancy: Check!
Connection ManagementThe last issue we want to tackle here is connection management.
When a client communicates with a server via TCP, the client uses its IP address and a (random) source port to connect to the server IP address on it's destination port (e.g. 80 for http). Since each quadruple can only be used once, there is a limit on how many connections are possible. Let us consider incoming connections to the load balancer.
The load balancer's destination IP (www.willhaben.at) and port (80) are fixed. The client IP is also fixed. That leaves the client's port range to be modified. Since these numbers are 16 bit fields, the maximum number of connections a client can produce are 216 or 64ki. That is more than enough for a single machine.
Internally, however, these requests need to be forwarded to the appropriate server. Now that the load balancer acts as the client, the same 64ki limit applies. The only difference is that now the limit applies to all connections. When this limit is reached, it is called port exhaustion, something we want to avoid desperately.
We do this by splitting our balancer processes in two parts:
- The actual frontend, where IPs are bound and ssl offloading happens and
- multiple backend processes where the pools are managed and the requests are forwarded to the actual server.
Both components are connected via the built-in proxy protocol v2. Taking the example from above, the frontend configuration now looks like this:
Depending on the amount of connections, we can just spawn additional backend processes.
The separate frontend also bears the advantage of separating ssl offloading from the backend connections. You may have noticed the additional bind on port 443. It means that all https connections on this IP are offloaded by the frontend as well. The result is a normal http connection to one of the backends.
Well, third time's the charm and that's essentially it.
SummaryThe three chapters presented here are a very coarse grained overview on what we actually did. We had to fight multiple sub-problems before we had a really reliable solution. However, the detail to describe them is simply out of scope and not interesting for the majority. To wrap it up, here are some numbers you might find interesting.
- We are running the frontend on 1 process spread to 8 Cores.
- All our traffic is SSL encrypted and offloaded by HAProxy.
- For 1Gbit/second of ssl traffic, the frontend cores are 60% loaded on average.
- We use 2 backend processes for server connections.
- Each backend process has 4 localhost-binds (yes thats a trick we did not mention, sorry).
- We have around 10k-20k connections per second.
- We have around 150k established connections between frontend and backend.
- We manage 120k established client connections on peak times.
- We manage around 40k concurrent SSL connections.
- We have around 600 SSL handshakes per second.
- We use http keep-alive for clients and frontend-backend connections.
- We use http-server-close on the server side, and a connection pool to restrict concurrency.
So, yes! You can absolutely replace a hardware load balancer with a software solution. All you need are some old servers, some time and some guys that know their TCP.