August 30, 2017

​ When Hystrix Cannot Save Us

Hystrix with Spring Boot in 2 Minutes

Today, the creation of web applications has been made really easy with frameworks like Spring Boot. Let's say we want to be sure that our application remains responsive even if a third party system does not answer our requests. Here’s how we can solve this problem with Spring Boot in some code snippets coming directly from our payment component:

The EnableHystrix annotation does the following, according to its JavaDoc:
All it does is turn on circuit breakers and let the autoconfiguration find the Hystrix classes if they are available (i.e., you also need Hystrix on the classpath).
Apart from that, we only need to add the HystrixCommand annotation to the methods, which we can cancel, in the case of a timeout or error:

That’s all. The circuit breaker will work as expected. Unfortunately, this is not always the case. If we use some default Spring settings, things can go terribly wrong.

Our production problem

One weekend, our payment servers suddenly stopped working. The users always got an error page, and no payments were possible.

In the log, we found several Hystrix timeout errors like this:

At first, it seemed that Hystrix had done its job. Our payment provider had a problem, and our clients still got an informative error page. However, this doesn't explain why the payment servers stopped working. Our excellent colleague in Operations took on that question. Hystrix’s thread pool was full of threads like this:

So, the Hystrix worker threads were hanging in a native socket read. That is BAD. Why? Let’s look at a quote from Oracle's concurrency documentation:

A thread sends an interrupt by invoking interrupt on the Thread object for the thread to be interrupted. For the interrupt mechanism to work correctly, the interrupted thread must support its own interruption.

The problem with the native socket read is that it does not support its own interruption, unless we enable the SO_TIMEOUT flag on the socket.

In general, a thread cannot be interrupted if:
  1. It never leaves the RUNNABLE state. A socket read without a timeout is one example; an endless loop can be another.
  2. It does not handle the InterruptedException correctly. Catching and doing nothing does NOT mean it is handling correctly. It should set its interrupted flag and exit.

Actually, it seems we are not the first to encounter this problem. Most probably, Spring Boot does not define the default socket connect and read timeout for REST templates.

Reproducing the problem

I am usually not happy until I can reproduce a bug before I fix it. I could not find the default timeouts of the REST template with a quick Google search, so I just asked my already-mentioned operations colleague how we could reproduce something like this. He suggested iptables. After some research, I have found a way to simulate a case where the connection to the third party system can be established, but no content is ever sent back. Let’s assume the third party system has the IP and uses Port 443 to accept requests. The following command will instruct iptables to result in a read timeout while connections can be still created (Linux admin rights are necessary to issue this command):

This does the trick. The rule can be removed later, if it is called instead of “-A” with the switch “-D”.

With this configuration, we successfully reproduced the problem in our test system. By simply issuing some payments, the Hystrix threadpool became exhausted in the test system, too.


The solution was to configure the REST template to use a request factory with correctly set timeouts. Let’s say that we want to configure the socket timeouts in the application.yml like this:

Then, we also need the following bean definition:

That's all. See the stackoverflow post, which is referenced in the JavaDoc of the customHttpRequestFactory method, if you are interested in the Spring magic, which injects the properties into the request factory. After creating the factory, we inject it into the RestTemplate.After applying this fix, the Hystrix problem was no longer reproducible. Our payment survived the connection problems in the test and, fortunately, in production environments as well.

Lessons learned

  • We should check the default settings of third party libraries. If someone develops a framework like Spring, it can be hard to find reasonable defaults for settings that are suitable for the majority of people. For this reason, I do not blame them for not defining a default timeout for sockets behind the REST connections. I would probably intentionally choose some short timeout, like 1 second, and if that resulted in a timeout, I would include a link to the documentation about changing timeouts. This way, everyone would likely be able to set a well-fitting timeout for his or her projects, since, during the test phase, everyone will encounter such exceptions. Still, every framework has to make decisions if it comes to defaults, and it is a good thing if we do not have to cope with these during the investigation of a production bug.
  • We can even reproduce weird errors if we want to. This remains the most reliable way to fix bugs.
  • Understanding the limits of a framework or programming language is good. In Java, no interruption of runnable threads is possible, so even Hystrix is unable to do this.
  • We might want to write integration tests for third party libraries. Once upon a time, I wanted to test how thread pool executors behave if the executed tasks throw an error, so I created a unit test for it. We had an architect who suggested that I was writing useless tests, since the JDK class works anyway. Let’s assume that he was right (meanwhile, we know that even the JDK contains bugs). I just wanted to write a test to make sure that it would behave as I expected it to. It remains up for debate whether such tests are useful. I think they are, to a certain degree, of course.
  • Bug fixing still is the easiest if you have a professional attitude that focuses on teamwork between operations and development. We should unite our knowledge, and then we can solve virtually anything. We cooperated with my operations colleague, and it worked. The bug has been solved.

No comments:

Post a Comment