January 22, 2016

Micro Benchmarking with JMH

Developing software and algorithms normally becomes more difficult with the amount of data that needs to be handled. Processing thousands of data records on modern hardware is no problem at all. However, when millions or hundreds of millions of records need to be processed, it can be quite difficult to meet the expected runtime performance. A batch job that should be executed every hour, for example, has to finish within this exact time frame. So how can we efficiently improve runtime, even in large-data contexts? Vertical scaling has its limits and horizontal scaling can take more effort than optimizing the batch job performance itself. Alternatively, let’s look at micro benchmarking as a means to optimize certain parts of a programm.

Let’s assume we have two sorted lists with integers and we want all integers that are in the first list but not in the second one. According to set theory we want A \ B (blue area in the figure below).

Some Java code for this may look like this:
public static List subtractWithDefaultLambda(List list1, final List list2) {
  return list1.stream().filter(i -> !list2.contains(i)).collect(Collectors.toList());
The runtime measurement of an example program typically looks like:
long start = System.currentTimeMillis();
long elapsed = System.currentTimeMillis() - start;
System.out.println(" time = " + elapsed + "ms");
It is composed of these steps:
  1. Record the start time
  2. Execute the code that should be benchmarked
  3. Record the stop time
  4. Calculate the difference

Problems of runtime measurement

These lines of Java code have some issues:

To begin with, upon execution, the above function currentTimeMillis() may not really return milliseconds. Depending on your operating system, the return value could range from 1 ms on a modern system to 55 ms on Windows 95 (http://mindprod.com/jgloss/time.html#ACCURACY). Obviously, this may become a problem depending on your task at hand: if the task itself takes 2 minutes this delay is tolerable, but if the task only takes 2 ms any major deviation is not acceptable. Alternatively, System.nanoTime() could provide better performance on modern operating systems, but comes with equal restrictions concerning older systems. Yet other methods like getCurrentThreadCpuTime() also have their problems, such as operating system support etc.

Another issue that comes with this code is the fact is that Java uses a JIT (Just in Time) compiler which optimizes code during execution. At the beginning it is normally slower and gets faster and faster throughout the lifespan of the process. Methods, for example, are only compiled after a certain amount of calls have been made. This depends on the JVM and even the type (client or server). Furthermore, a typical JVM loads a class upon initial usage. Class loading involves a lot of I/O, which is usually slower than mere memory access. These two JVM issues can be addressed using “warm-up” – executing the code several times before performing the actual measurement. However, there are also other issues like on-stack replacement and deoptimization, but since JIT is such an extensive topic these issues won’t be covered at this point.

JVMs also have a feature called dead-code elimination (DCE): code that is not used, and thus has no side effects, is removed to avoid wasting CPU cycles. If you only calculate the result in a benchmark and do nothing at all with it, it may never get executed and the benchmark does not measure what it was supposed to. This problem can be solved e.g. by writing the result to the output stream. Depending on how fast and how much output it is, this can have a huge performance impact on the method itself.

Another important point that has to be considered is caching. Nearly everything has caches – hard discs, CPUs, etc. Consequently, you have to bear in mind that performance outcomes can vary depending on whether these caches are “cold” (do not contain any data) or “warm” (all data already loaded). With this in mind, how wrong is our example?
long start = System.currentTimeMillis();
long elapsed = System.currentTimeMillis() - start;
System.out.println(" time = " + elapsed + "ms");
Solving all of the issues is a hard task, but the quality of the measurement can easily be improved using micro benchmark frameworks like JMH. Benchmarking is a simple process and using its framework is as straight-forward as that:
public static void subtractWithDefaultLambda(Blackhole hole) {
   hole.consume(IntegerArrayUtil.subtractWithDefaultLambda(getAllIds, getIdsToSubtract));
The @Benchmark annotation tells the library that this method should be benchmarked. There can be several methods with these annotations. All of them will be tested separately. The blackhole class actually consumes the output of our code which prevents a dead code optimization of the JIT compiler.

The output of the measurement (on my machine) will be:
# Run progress: 0,00% complete, ETA 00:00:06
# Fork: 1 of 1
# Warmup Iteration 1: 1,000 ops/s
# Warmup Iteration 2: 1,000 ops/s
# Warmup Iteration 3: 1,000 ops/s
Iteration 1: 1,000 ops/s
Iteration 2: 1,000 ops/s
Iteration 3: 1,000 ops/s
Result "subtractArrayList":
0,182 ~+mn~(99.9%) 0,004 ops/s [Average]
(min, avg, max) = (0,181, 0,182, 0,182), stdev = 0,001
CI (99.9%): [0,178, 0,186] (assumes normal distribution)

# Run complete. Total time: 00:00:34
Benchmark                Mode  Cnt  Score   Error  Units
Main.subtractArrayList  thrpt    3  0,182 ~+mn~ 0,004  ops/s
The result contains not only the actually performance in operations per seconds but also the variance. Our method usually can be executed 0,182 times a second with a minimum of 0,181 times a second.
That is a lot more information compared to our first example. Furthermore our code was executed several times and some warm ups iterations were executed as well to fill caches and address JIT compiler issues.

Putting it into practice

Having this benchmark in place our program can now be optimized. One measurement alone does not provide any value.  Instead of having only one method to measure we implement different ways to fulfill our task:

Our first Solution:
public static List subtractWithDefaultLambda(List list1, final List list2) {
   return list1.stream().filter(i -> !list2.contains(i)).collect(Collectors.toList());
Works so far - but not the fastest. Knowing that we have a sorted list and the complexity of contains() on an array list is O(n) we can optimize this code to:
public static List subtractArrayListBinarySearch(List list1, final List list2) {
   return list1.stream().filter(i -> Collections.binarySearch(list2, i) < 0).collect(Collectors.toList());
Binary Search has a complexity of O(log(n)) which is way better than O(n).
Let’s compare the results:
# Run complete. Total time: 00:00:42

Benchmark                            Mode  Cnt   Score    Error  Units
Main.subtractArrayListBinarySearch  thrpt    3  36,752 ~+mn~ 61,920  ops/s
Main.subtractWithDefaultLambda      thrpt    3   0,179 ~+mn~  0,036  ops/s
Using an attribute of our data (sorted) we can improve the performance by 200x. Small changes like this are the main use case for micro benchmarks - making minor changes and observe the results.


Benchmarking and performance measurement of code and algorithms has some pitfalls that have to be avoided to get a proper result. JIT compilers, caching and operating system limitations have impacts on the results. Thus, frameworks for micro benchmarking like JMH can improve the quality of runtime measurements without adding further complexity to the code. They also provide more data and insights compared to “normal” solutions. However, keep in mind that optimizing some parts of your code may not result in a better overall performance. Tuning algorithms and methods always has to be seen in the overall context of the application.

The code of these examples can be found in our GIT repository at


No comments:

Post a Comment