October 14, 2015

Resilient Software Design

Software design and architecture have to meet a lot of non-functional requirements: accessibility, certifications, testability and maintainability – just to name some of them. In this article, we will look at resilient software design: it addresses, of course, the requirement of resilience but also availability.

Availability is one of the most important aspects if your business depends on your website to be up and running. If the website is down for one hour you are not earning anything in this time.  Formally spoken, availability is defined as a relationship between MTTF (Mean time to failure) and MTR (Mean time to recovery):

In former days it was quite easy: there was one monolithic system. If you wanted to achieve 99% uptime, you just had to make sure that your system achieved exactly this availability rate. This meant that you would have about 87.6 hours a year downtime, e.g. for maintenance, which is quite easy to handle.

Given modern requirements we all know that a monolithic system is not suitable for current system environments. Nowadays, IT landscapes incorporate a wide range of services. Switching to micro services increases the number of services even more.

So let’s have a quick look at the numbers, sticking to our 99% availability rate for the web server and adding two dependent services. The web service only works if every component is up and running.

As we can see in the illustration above, the overall availability for the web server would only be 97% (that are 262.8 hours downtime a year!) if we stick to 99% for each subsystem. To actually achieve our goal of a 99% availability rate for our web server, we must increase availability of each subsystem. After doing the maths (0.99 ^ 1/3) we can see that we need about 0.997% uptime for each subsystem (about 29.3 hours downtime a year) for which we actually need a Tier 4 data center.

And remember: This example only works for three services. Applying the model to an environment with roughly 100 services (and not even considering load balancers, firewalls, switches, etc.) will yield a different picture altogether.

All of this shows that increasing the MTTF to the absolute maximum is essential. However, availability is not the only issue. If a system goes down chances are high that there will be some sort of chain reaction. If a system is slow all other systems that depend on it are also slowing down. Depending on your traffic, this will fill up your thread pools and make the system unusable. Furthermore, if a system goes down it is possible that other systems that are still requesting data from it prevent the system to start up again. As you can imagine, this can lead to failure of the overall system. This, on the other hand, might take a long time to recover and requires a lot of manual work to start the services again considering the right order and service dependencies. To prevent this scenario, there is another approach to system architecture, a resilient one.

Going resilient

In a resilient IT landscape it is accepted that errors occur. All components are designed to deal with situations in which e.g. another service is not responsive or down. Instead of maximizing the MTTF, we minimize the MMTR.
There are different principles to build such an IT landscape. In this article we just want to explain the ones that are the most important to us:

Loose Coupling

Loose Coupling is not only useful in software design itself but also from a system point of view. For example, services that are communicating using a queue are less likely to cause overall failure when only one service fails. Asynchronous communication has a lot of benefits, e.g. you can deploy new versions of a service without any downtime of other services. All messages will just be processed when the new services are up and running again. Other aspects like scalability come more or less for free (e.g. just start more instances listening in the same queue).


Each system must be as independent of all other systems as possible. In the literature, this is often called bulkheads or failure units. These units are designed to prevent a failure to cascade through the system. This can be achieved e.g. by monitoring request times and by stopping requesting information from another system if the response times are too high. For this scenario, fallbacks are used so that the overall system can still be in (limited) use.


Redundant services provide a lot of scenarios in which they can be useful. They can be used to improve scalability by splitting the requests between more instances or to implement a more failure-proof system by comparing two responses of two instances of the same services. But to us, the most important is automatic failover. If one service is down, the other service can still handle the requests. Either way you get a more robust IT landscape independent of a single point of failure.

All of these are techniques that happen far away from the end user. Even if some services may work with a reduced work set, this will finally have an impact on e.g. the web site that is using these services. Preventing a system from complete failure, but still keeping it up and running with reduced functionality, is a concept called Graceful Degradation Of Service.

Graceful Degradation Of Service

On a website, in most cases, it is really easy to “gracefully degrade” a quality of a service. To give an example let’s assume that willhaben.at is a bunch of components. Each of it has its own services running. 

As you see in the picture in our example. If some part of the autocomplete services fails, we still can serve most requests out of the cache (for a detailed explanation you can read the blog post about autocomplete here). So we keep providing the service to the user but with a lower service quality. Even if the complete service fails, it won’t affect other parts of the site. We will just deactivate the complete feature automatically.

Another example would be if the service for user-related action fails. Then we just can deactivate user-related functionalities like login or adding new items to our site.
Either way, the core functionality (i.e. searching for items on our site) will not be affected, so we can still serve a relatively high percentage of happy users interacting with the site who remain unaware of the problems. We assume that we have one component for autocomplete and one for user-related actions.


Leaving the “all or nothing approach” for uptime behind and accepting that parts of the system and parts of our web site can fail, while others are still working fine, leads to a more robust IT landscape, increased overall availability and a more stable system. Building resilient IT systems does not have to be rocket science after all. Starting with small parts, like using Hystrix and adding more and more features of a resilient infrastructure, will incrementally improve each service, the overall service interaction and system architecture. 


  1. I read your post. It was very nice. I learned something knowledgeable. Thanks for sharing.
    school web design

  2. This system of CRM is installed to the server of the company and the managers can manage the clearance of security personally in all the information that is sensitive. click here

  3. It can be utility programming, music programming, gaming programming or some other programming and you can get all of them downloaded from the free programming download webpage at simply the snap of a catch.Check This Out

  4. Why contract programmers are leading in software development industry is a question that is a frequently asked question. There is a curiosity regarding contract programmers and their work. This is not at all a new concept yet people are doubtful about hiring one. See more c programming homework help

  5. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon. Big thanks for the useful info. Software

  6. Great calculation so far! I didn't know much better about all the thing are described here in detail. I think software student should follow all this information to get a lot of skill on software design and development. Thanks a lot for this helpful post. Reztech

  7. Albert GlennJanuary 23, 2017

    I agree, the Building resilient IT systems is not at all difficult. Just a little knowledge about the basics are needed. There are apps and software available for it and it is really easy to work with them. Your post is pretty interesting, keeping track of everything. IT outsourcing companies Anaheim .

  8. Another major advantage is you can get detailed information about the usage of the software, its specifications, its advantage and disadvantage and for what purposes this software are better.

  9. if you incorporate the right Call to Actions, you can guide the visitors to keep on surfing the site, and this in complete support of your Google ranking position. Explainer videos

  10. One best example of such system is intercom system. Those who want tight security install a Video Intercom System, which is high tech system for a clutter free safety and security. Apartment video intercom installation

  11. Admiring the time and effort you put into your blog and detailed information you offer.Thanks.
    the best carding forum

  12. Windows 10 reserved but gets missed after installing it- If you've reserve windows 10 but don't know why it gets invisible. kmspico

  13. The development from 2D design software making road maps, pipeline maps or even electricity grids lines to the all-new three dimensional design software of the modern world that expanded its uses from designing to the modern day floor planning has opened gateways to a whole new dimension and paved the road to technological advancements and innovative ideas for all times to come.cheap SolidWorks 2010 software

  14. A master's degree is an academic degree that is awarded on completion of a postgraduate undergraduate program. There are various accredited colleges and universities that offer distance education postgraduate programs in accounting, paralegal, healthcare, business management, engineering, electronics, computer science, marketing, and hospitality management.http://www.how-todo.xyz/

  15. Not all sources on the Internet are worth trusting with respect to information about the latest gadgets and advancements in technology.Techwitty

  16. This app is very important also in an organization or family where infidelity is suspected. The suspect can be tracked using the text spy so that all the conversations are seen. http://www.iphonetrackingapps.com/how-to-recover-deleted-snapchats-on-iphone/

  17. Info is out of this world, I would bang to see more from your writers.Timestamp

  18. Articulately written and well figured out.
    Decimal Numbers