September 10, 2015

Running Austria's Most Visited Web Site on a Software Load Balancer? Impossible!

- Nah, it's totally doable. With HAProxy.


When a website reaches a certain size, it's necessary to distribute load between multiple server instances. Incoming requests are spread via a load balancer which, in turn, represents the most critical part of the network infrastructure. Among the best known hardware products are probably the BIG-IP F5 series. Unfortunately, using a hardware solution does not provide the flexibility we need. We constantly want to change stuff like pool members, create new pools, manage IP quotas and kick unwanted or abusive traffic from our servers. Plus, if you plan to switch from HTTP to HTTPS entirely, performance also becomes an issue.

So we decided to go for it and implement our own solution. We chose HAProxy and did not regret it. The reasons are manifold but mostly we were looking for open source, speed, and scalability.


To start, get your favourite Linux flavour and install HAProxy. We use Red Hat 7.1 on an "old" Server that has two Intel Xeon X5560@2,8 GHz, resulting in 16 logical cores. Disks are ordinary (non-SSD) disks and the server has 6x1-Gigabit Network cards on board.

Once installed, the HAProxy configuration is changed to something like this:

listen http-frontend
    bind    #
    balance roundrobin
    server  web1
    server  web2
    server  web3
    server  web4

You have just created a high-performance load balancer for For us, however, it did not end here. We had to tackle some demanding problems before we could even think of going productive with our solution. But enough boring installation procedures, here comes the really interesting part.


Traffic Volume

Traffic Volume
On a typical day, our platform dishes out 800-900 Megabits per second, not counting traffic generated by mobile devices. On good days, we are scratching the 1Gb/s mark.

Obviously, a 1GBit link is not sufficient. Enthusiastically, we decided to go with 10GBit interfaces. However, we didn't expect how ridiculously expensive these things are. Especially when you need four of them, including a fiber-linked switch behind them. So, no 10GBit solution.

Instead, we use four of our six 1Gb interfaces to create a 802.3ad conform network bond.
The configuration from /etc/sysconfig/network-scripts/ifcfg-bond0 looks like this:


BONDING_OPTS="miimon=100 mode=4 xmit_hash_policy=2" 

The last line is the key part, where mode 4 (802.3ad) is defined. This line also defines the algorithm that decides which interface to use for outgoing traffic. The xmit_hash_policy creates a hash for a connection and divides it modulo the interface number (4 in our case). Per default, the fastest hash type is used and that is layer 2 only. We had to switch this to layer 2+3 because MAC addresses on VLANs do not change. We would always get the MAC of the switch/router and therefore end on the same interface. The same setting also needs to be done on the switch for incoming traffic.

With this setup we achieve both, network redundancy and a decent traffic distribution:
iptraf-ng 1.1.4
┌ Iface ──────────Activity ─────────┐
│ eth0         19635.38 kBps        │
│ eth1         37331.79 kBps        │
│ eth2         20382.10 kBps        │
│ eth3         39757.07 kBps        │
│ bond0          113.56 MBps        │

What? We are getting more than 1Gb/s already? Nice catch. But note that this is overall traffic. When the balancer uploads an image to the client, for example, it needs to download it from the server first. And since all interfaces are full duplex, a Byte can be sent and received at the same time.

Alright. First Problem solved. We have enough bandwidth, yay!



Now that we have a high-bandwitdh load balancer, we don't want it to be the sole bottleneck for the whole platform. We need redundancy. Basically, there are two possibilities to achieve it:
  1. Two active nodes: In this case we need a public IP address for each node and broadcast both via DNS records. The advantage is that we would get additional balancing. A disadvantage is that we need to start considering client-side mechanics when debugging a problem.
  2. One active node with hot standby: Here, we simply clone the existing machine and if the master fails, the slave takes over. 
We decided to go for solution 2 with a master/slave setup. Another consideration when designing a load balancer is that normal operation is guaranteed in case of a failure. In other words, if one server breaks, the solution still needs to be able to handle the whole traffic.

We implemented the master/slave design with keepaliveD. This deamon uses vrrdp as a protocol to syncronise IP addresses between hosts. If the master goes down, all IPs are simply transferred to the slave. So what we have until now is something like that:

The keepalived configuration is pretty straight-forward:

vrrp_script check_haproxy {
        script "/secret/" #make sure everything is fine
        interval 2                  #check every 2 seconds
        weight 2                    #add weight if OK

vrrp_instance HAproxy {
        state MASTER
        interface bond0
        virtual_router_id 10
        priority 101
        advert_int 1
        virtual_ipaddress {
        ..... dev bond0 label bond0:3        .....        }
        track_script {

In a nutshell: All IP addresses from the virtual_ipaddress section are transferred to the slave if the script fails. The only thing left to do is add net.ipv4.ip_nonlocal_bind=1 to /etc/sysctl.conf. It allows a service to bind to a socket, which consists of an IP+Port combination, that does not yet exist. This is necessary to start the slave without its assigned IPs.

And that's another job well done. Redundancy: Check!

Connection Management

The last issue we want to tackle here is connection management.
When a client communicates with a server via TCP, the client uses its IP address and a (random) source port to connect to the server IP address on it's destination port (e.g. 80 for http). Since each quadruple can only be used once, there is a limit on how many connections are possible. Let us consider incoming connections to the load balancer.
The load balancer's destination IP ( and port (80) are fixed. The client IP is also fixed. That leaves the client's port range to be modified. Since these numbers are 16 bit fields, the maximum number of connections a client can produce are 216 or 64ki. That is more than enough for a single machine.

Internally, however, these requests need to be forwarded to the appropriate server. Now that the load balancer acts as the client, the same 64ki limit applies. The only difference is that now the limit applies to all connections. When this limit is reached, it is called port exhaustion, something we want to avoid desperately.

We do this by splitting our balancer processes in two parts:
  • The actual frontend, where IPs are bound and ssl offloading happens and
  • multiple backend processes where the pools are managed and the requests are forwarded to the actual server.

Both components are connected via the built-in proxy protocol v2. Taking the example from above, the frontend configuration now looks like this:

listen http-frontend
    bind ssl crt oursslcertificate.pem
    bind    #
    balance roundrobin
    server backend1 send-proxy-v2
    server backend2 send-proxy-v2

Depending on the amount of connections, we can just spawn additional backend processes.
The separate frontend also bears the advantage of separating ssl offloading from the backend connections. You may have noticed the additional bind on port 443. It means that all https connections on this IP are offloaded by the frontend as well. The result is a normal http connection to one of the backends.

Well, third time's the charm and that's essentially it.


The three chapters presented here are a very coarse grained overview on what we actually did. We had to fight multiple sub-problems before we had a really reliable solution. However, the detail to describe them is simply out of scope and not interesting for the majority. To wrap it up, here are some numbers you might find interesting.
  • We are running the frontend on 1 process spread to 8 Cores.
  • All our traffic is SSL encrypted and offloaded by HAProxy.
  • For 1Gbit/second of ssl traffic, the frontend cores are 60% loaded on average.
  • We use 2 backend processes for server connections.
  • Each backend process has 4 localhost-binds (yes thats a trick we did not mention, sorry).
  • We have around 10k-20k connections per second.
  • We have around 150k established connections between frontend and backend.
  • We manage 120k established client connections on peak times.
  • We manage around 40k concurrent SSL connections.
  • We have around 600 SSL handshakes per second.
  • We use http keep-alive for clients and frontend-backend connections.
  • We use http-server-close on the server side, and a connection pool to restrict concurrency.

So, yes! You can absolutely replace a hardware load balancer with a software solution. All you need are some old servers, some time and some guys that know their TCP.

1 comment: