Monitoring metrics using Prometheus

Published in

willhaben Tech Blog

12 min readSep 11, 2018

Some time ago, I wrote about StatsD for monitoring performance metrics of applications. Back then, we found out that StatsD is based on some nice and simple ideas, which makes it a very lightweight component that will never ever influence the performance of your business application. On the other hand, we missed some features, such as a tagging mechanism for grouping metrics by specific parameters.

More than six months went by and we played around with some other tools for getting insights into our applications. We gathered some experience especially with Prometheus, so today I think it is time to share some of our new knowledge with you.

This post is quite long, so if you’re already familiar with Prometheus, you might want to find out why we don’t use its official client libraries and why we use Micrometer instead. You’re already familiar with Micrometer, too? Then you might be interested in some noteworthy practices. Or, of course, read on and see why we need all that here at willhaben.

Motivation

So, what is it all about? Behind the scenes, willhaben is more than just a website. A large number of services run tons of features. A team of operators ensure that all our services are up and running on a 24/7 basis, but this is only one part of the business: for us developers it’s also interesting to see, if these services do what we expect them to do and also how these features perform.

“gray satellite disc on field” by Donald Giannatti on Unsplash

May I introduce…

Prometheus [ˌpʀoˈmeːtɔɪ̯s] is an open source systems monitoring and alerting toolkit. It brings a multi-dimensional data model with time series data identified by metric name and key/value pairs. To access this data, Prometheus comes with a flexible query language to leverage its dimensionality. (Source: https://prometheus.io/docs/introduction/overview/)

A list of client libraries offers easy integration for our applications in all major languages, but be warned: our experiences with the Java client libraries were not really satisfying. Read on for more information.

Small pieces to get an overview

What we want are clear statistics about our application’s performance. Prometheus provides a simple and fast way to get these insights and all it needs is just some detailed raw data. Our application has to provide this raw data. That might sound difficult, but all it needs is a simple set of four metric types:

Gauge
Counter
Histogram
Summary

A Gauge [geyj] is a simple numeric value that can go up and down and exposes a current state of the application. This could be the CPU load or the megabytes of memory currently used by the application. Depending on your application you might show the number of currently logged in users or in our case the number of active adverts.

Then there is the Counter, which is also a simple numeric value, but in contrast to a Gauge value, it cannot go down, so it’s a monotonically increasing value. At application startup, every Counter is initialized at 0, it cannot be reset except at the next application startup. So, what could we do with a Counter? Of course, counting things! You might want to count the number of events processed by your application, the number of e-mails sent out, the number of errors in your log, the number of logins, the number of handled HTTP requests,… oh I could count everything in the world.

Okay, you might be wondering now, what benefit you have when knowing that since upstart your application did one action 2.492.352.286 times. Right, that’s not really helpful. But Prometheus will remember all the values at a given time. With the current value, the value one minute ago, two minutes ago, three minutes ago, and so on, it can calculate how fast this value is increasing. In other words, you can, for example, derive the number of events per minute from a Counter.

Histogram and Summary enable us not only to count things but also to track observations, such as request durations or response sizes. In Histograms, events are collected into buckets, which help us to monitor service level agreements (SLA). Behind the scenes, these metric types use simple Counters, as described above and with that can also be used to just calculate average values, such as the performance of our application in the last x minutes.

I don’t want to dive too deep into this topic now. If you want to know more, check out the official Prometheus documentation about metrics and the great introduction into histograms and summaries.

Tagging the data

One important feature of Prometheus — and the feature I personally missed most with StatsD — is the ability to label metrics. This allows you to put additional information about the observed data into your metrics and with this lets you categorize your numbers.

So, let’s say you’re recording all the HTTP requests coming into your application. You measure their execution time and export this information for Prometheus. In the end, you’re able to tell how many requests came in and how long they took.

Now you have an average for your service, great! You may want to improve the performance of your service and want to know, which of the HTTP calls take longer, which of them need some optimization. It would be great if you could now filter your observations by some meta information — and that’s where labels (or tags) come in.

Every measurement can come with some key-value-pairs (the labels) providing meta information about the observed value. In our example, when recording HTTP requests, one observation could apply the following labels:

http_method: POST
http_status: 200
http_url: /upload

Additionally Prometheus itself could also be configured to add some more labels to our metrics, such as

instance: server1
environment: production

When working with labels, keep a few things in mind: Prometheus is a monitoring toolkit. It’s not a logging framework. It stores average values and overall counts, but it is not intended for storing information about one single event. It can help you to find general issues in your code that affect many users, but it won’t help you to debug the problem of one single user.
Furthermore, it’s not intended for storing user-generated content, not even usernames or user IDs. What would happen if you pass the user’s ID as a Prometheus label? It would create an index for every single user in your application. 💥 Boom.

Scraping all services

Ok, your service counts everything and provides information about the current load and usage. But how does this information get from your application to Prometheus? There are different ways to achieve that, but the simplest one is to provide an HTTP endpoint in your application that exposes all the data.

I already mentioned that Prometheus remembers all the values for a given time, so what it does to receive these values is to scrape them. That simply means it periodically (for example, every 15 seconds) accesses the HTTP endpoint of your application and indexes all the information for later use. By default, Prometheus retains all of this data for 15 days.

Interlude with simpleclient

Now that we know the basics about Prometheus, let’s take a look how all this is integrated in our applications. Here at willhaben, almost all applications are written in Java, so we checked out the official Prometheus client library, also known as client_java or simpleclient. But be warned! I don’t recommend this.

At first sight, simpleclient is a lightweight component that does not require any other dependencies, and it’s pretty easy to use. Sounds perfect to me. But after running one of our core applications in our test environment for a few weeks, I found some strange issues, especially with the default metrics. I saw metrics not following the official naming guideline and even worse: some of these metrics passed user-generated content as labels, which is a very bad idea, as described above.

beautiful facade by Dmitrij Paskevic on Unsplash

Micrometer

Luckily, there is a great alternative for Java applications out there. Micrometer, a vendor-neutral application-metrics facade supporting numerous monitoring systems, including Prometheus. So, the same way what slf4j is for logging, Micrometer is for metrics. And this library is not a kid anymore, it’s already part of the big ones: starting with Spring Boot 2 spring-actuator provides dependency management and auto-configuration for Micrometer.

Almost all of our applications here at willhaben use Spring, the newer services use Spring Boot, and we even already have some applications on top of Spring Boot 2.

Thanks to auto-configuration, the integration in Spring Boot 2 is pretty easy. All you need is to add two dependencies to your maven application:

By default, this will already enable many metrics for monitoring, including…

the JVM including memory usage and buffer pools, garbage collection, threads and class loaders,
CPU load,
file descriptors,
log messages (counting errors, infos,…),
uptime statistics and
tomcat performance

The Prometheus endpoint is disabled by default, so one last thing needs to be done before we can start collecting metrics, we have to add Prometheus to our exposure list:

As mentioned before, our application has to provide access to these metrics over HTTP, so when adding this to a non-web application, we have to add one additional dependency for enabling the web server functionality

One, two, three — our first custom counter

While the default metrics already provide a lot of interesting insights into our applications, we also want to monitor our own code and all the complex decisions our applications make all day long. Let’s say we have a task that processes events and we want to monitor its performance:

First, we could simply count all the interactions with our code:

Remember, we’re calling a facade here, so this call will inform all configured monitoring systems about this event. In our case Micrometer will transform the metric name message.sent to the appropriate Prometheus name message_sent_count and will apply the label result with it’s value — successin this case.

Note: When using Prometheus, the labels sent with a metric key must always be the same, so in this case, every time your application increases this counter with the name `message.sent` it has to send the additional label `result` with a value — and also no additional label.

Ready, set, go! — our first custom timer

Let’s say you want to measure the time it takes to process some events in your application. This can be achieved very easy too:

This observation will result in (at least) three different metrics, called mytask_duration_seconds_count, mytask_duration_seconds_sum and mytask_duration_seconds_max.

Configuring the buckets

To collect events in Histogram buckets we could programmatically configure our metrics in our Java code, but since we are in a Spring Boot application, this, of course, can be configured very easily and dynamically for every metric in your application.yml:

When using Spring Boot 1, you have to enclose your metric name in braces:

Noteworthy practices

I don’t want to dive much deeper into the basics of Prometheus and Micrometer, because both of them provide a very good documentation. If you got to like them and want to integrate them into your own application, check out the following sites:

https://micrometer.io/docs — official Micrometer documentation
https://prometheus.io/docs/ — official Prometheus documentation
https://docs.spring.io/spring-boot/docs/ — official Spring Boot 2 documentation about monitoring with Micrometer

What might be more interesting for you now is learning about how we use Micrometer and Prometheus here at willhaben. So, I want to share some noteworthy practices with you…

Enable interesting metrics

When you add Micrometer to Spring Boot 2, it will publish many interesting metrics about your application by default, but there are also some interesting metrics shipped with Micrometer that have to be enabled manually. For this reason, I suggest to walk through the Micrometer repository and check if there is interesting stuff for your application. In our case — since we use Hystrix a lot — the HystrixMetricsBinder is such a case.

To activate Micrometer metrics for Hystrix, we simply have to create an instance of the HystrixMetricsBinder. In Spring Boot, this is as simple as adding the following Bean:

Once instantiated, it will be automatically registered in your global Micrometer registry.

Common labels

As explained above, every Prometheus metric can provide various labels (or tags) to describe the measurement. For example, a metric measuring the request duration of all your HTTP requests might come with the following labels:

http.verb: "GET"
http.status.code: "200"

Note: When defining labels in Micrometer, you should use a single dot as separator, for example "http.status.code". When publishing these labels to Prometheus, Micrometer will automatically convert them to snake case, such as "http_status_code".

Furthermore, the Prometheus scraper might add some information (if configured to do so) to tell you the server/instance from which this particular metric has been read from:

env: "prod"
instance: "192.168.1.1:9001"
job: "your-application"

We saw that it’s also a good practice to add some additional information about your application with every metric that is generated by your application, such as the application’s name and some details about the running build. With Micrometer, this can be configured in one single place and does not have to be set for every single metric.

For our Spring Boot applications here at willhaben, we simply add the following Configuration class:

With this, you will get four additional Prometheus labels:

application — The name of your application. In this case, it’s read from the applications configuration with a default value of your-application-name.
build.group — The group from your pom.xml.
build.artifact — The artifact from your pom.xml.
build.version — The version from your pom.xml. In our case here at willhaben, this value is overwritten by the build server, so that it also contains the build number in addition to the version.

This configuration class depends on a bean of type org.springframework.boot.info.BuildProperties. This bean is generated by activating the build-info goal of the spring-boot-maven-plugin in the pom.xml:

Reuse available Grafana dashboards

With Micrometer, our applications provide the same basic metrics as many other applications out there, and in many cases we are not the first one who wants to visualize them. Here at willhaben, we use Grafana for visualization, and Grafana comes with the feature for exporting and importing full dashboards.

Dashboards can be shared at grafana.com and there are already some nice ones for micrometer metrics. Even if they might not immediately fit all of your needs, they are, in most cases, a good starting point.

Display metrics on a screen

Here at willhaben, every scrum team has its own wall-mounted Smart TV screen that the team can use for their work. With Grafana, we can build special dashboards, providing up-to-date information about the health of our applications. Red colors indicate bad states and motivate us to fix these issues :-)

We are also already thinking about integration with Jira, Bitbucket, our build server, and maybe even other development tools, so that we can also display information about the progress of the current sprint, status of a project, open pull requests, bad build states and, hopefully, much more. Let’s see what the future brings :-)

Conclusion

Micrometer, together with Prometheus, is a great tool and has already helped a lot to monitor the performance of some of our most accessed applications. In particular, it helps us to see all the complex decisions our applications are making, it helps us to know which part of an application demands performance improvements and all in all, it helps us to detect issues with our applications at an early stage.

From a developers perspective, we can now specify directly in our code what metrics should be measured and how they should be indexed. We don’t have to parse this information out of huge log files anymore, which often ended up in broken statistics and a lot of overhead — quite the opposite: we could remove log output and reduce the size of our log files, while getting more insights into our applications, thanks to Prometheus.

Besides that we couldn’t see any performance impact yet, when integrating Micrometer into our applications. Even if this technology is far more complex than StatsD, it never influenced our daily business yet. Furthermore, it turned out that Prometheus does a great job at storing metrics: querying our data is amazingly fast and requires much less disk space than expected.

Micrometer definitely made it to our roadmaps for many of our services here at willhaben. We also want to improve our Grafana boards and take a closer look at its alerting feature. I’m pretty sure, we can share more about this topic with you soon.