October 14, 2015

Resilient Software Design

Software design and architecture have to meet a lot of non-functional requirements: accessibility, certifications, testability and maintainability – just to name some of them. In this article, we will look at resilient software design: it addresses, of course, the requirement of resilience but also availability.

Availability is one of the most important aspects if your business depends on your website to be up and running. If the website is down for one hour you are not earning anything in this time.  Formally spoken, availability is defined as a relationship between MTTF (Mean time to failure) and MTR (Mean time to recovery):

In former days it was quite easy: there was one monolithic system. If you wanted to achieve 99% uptime, you just had to make sure that your system achieved exactly this availability rate. This meant that you would have about 87.6 hours a year downtime, e.g. for maintenance, which is quite easy to handle.

Given modern requirements we all know that a monolithic system is not suitable for current system environments. Nowadays, IT landscapes incorporate a wide range of services. Switching to micro services increases the number of services even more.

So let’s have a quick look at the numbers, sticking to our 99% availability rate for the web server and adding two dependent services. The web service only works if every component is up and running.

As we can see in the illustration above, the overall availability for the web server would only be 97% (that are 262.8 hours downtime a year!) if we stick to 99% for each subsystem. To actually achieve our goal of a 99% availability rate for our web server, we must increase availability of each subsystem. After doing the maths (0.99 ^ 1/3) we can see that we need about 0.997% uptime for each subsystem (about 29.3 hours downtime a year) for which we actually need a Tier 4 data center.

And remember: This example only works for three services. Applying the model to an environment with roughly 100 services (and not even considering load balancers, firewalls, switches, etc.) will yield a different picture altogether.

All of this shows that increasing the MTTF to the absolute maximum is essential. However, availability is not the only issue. If a system goes down chances are high that there will be some sort of chain reaction. If a system is slow all other systems that depend on it are also slowing down. Depending on your traffic, this will fill up your thread pools and make the system unusable. Furthermore, if a system goes down it is possible that other systems that are still requesting data from it prevent the system to start up again. As you can imagine, this can lead to failure of the overall system. This, on the other hand, might take a long time to recover and requires a lot of manual work to start the services again considering the right order and service dependencies. To prevent this scenario, there is another approach to system architecture, a resilient one.

Going resilient

In a resilient IT landscape it is accepted that errors occur. All components are designed to deal with situations in which e.g. another service is not responsive or down. Instead of maximizing the MTTF, we minimize the MMTR.
There are different principles to build such an IT landscape. In this article we just want to explain the ones that are the most important to us:

Loose Coupling

Loose Coupling is not only useful in software design itself but also from a system point of view. For example, services that are communicating using a queue are less likely to cause overall failure when only one service fails. Asynchronous communication has a lot of benefits, e.g. you can deploy new versions of a service without any downtime of other services. All messages will just be processed when the new services are up and running again. Other aspects like scalability come more or less for free (e.g. just start more instances listening in the same queue).


Each system must be as independent of all other systems as possible. In the literature, this is often called bulkheads or failure units. These units are designed to prevent a failure to cascade through the system. This can be achieved e.g. by monitoring request times and by stopping requesting information from another system if the response times are too high. For this scenario, fallbacks are used so that the overall system can still be in (limited) use.


Redundant services provide a lot of scenarios in which they can be useful. They can be used to improve scalability by splitting the requests between more instances or to implement a more failure-proof system by comparing two responses of two instances of the same services. But to us, the most important is automatic failover. If one service is down, the other service can still handle the requests. Either way you get a more robust IT landscape independent of a single point of failure.

All of these are techniques that happen far away from the end user. Even if some services may work with a reduced work set, this will finally have an impact on e.g. the web site that is using these services. Preventing a system from complete failure, but still keeping it up and running with reduced functionality, is a concept called Graceful Degradation Of Service.

Graceful Degradation Of Service

On a website, in most cases, it is really easy to “gracefully degrade” a quality of a service. To give an example let’s assume that willhaben.at is a bunch of components. Each of it has its own services running. 

As you see in the picture in our example. If some part of the autocomplete services fails, we still can serve most requests out of the cache (for a detailed explanation you can read the blog post about autocomplete here). So we keep providing the service to the user but with a lower service quality. Even if the complete service fails, it won’t affect other parts of the site. We will just deactivate the complete feature automatically.

Another example would be if the service for user-related action fails. Then we just can deactivate user-related functionalities like login or adding new items to our site.
Either way, the core functionality (i.e. searching for items on our site) will not be affected, so we can still serve a relatively high percentage of happy users interacting with the site who remain unaware of the problems. We assume that we have one component for autocomplete and one for user-related actions.


Leaving the “all or nothing approach” for uptime behind and accepting that parts of the system and parts of our web site can fail, while others are still working fine, leads to a more robust IT landscape, increased overall availability and a more stable system. Building resilient IT systems does not have to be rocket science after all. Starting with small parts, like using Hystrix and adding more and more features of a resilient infrastructure, will incrementally improve each service, the overall service interaction and system architecture. 


  1. I read your post. It was very nice. I learned something knowledgeable. Thanks for sharing.
    school web design

  2. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon. Big thanks for the useful info. Software