Monitoring metrics using StatsD

Benjamin Hubert
willhaben Tech Blog
7 min readSep 20, 2017

--

With more than 4.5 million public ads and an incredible number of user interactions, we have a huge amount of events to process each day. Events like new incoming content, user registrations, messages, notifications — you know events for our business of bringing buyers and sellers together.

For this, we run dozens of micro services behind the scenes, small applications that do nothing else than processing events, applications that don’t provide any user interface. Monitoring such services is more than just checking if they are up and running, it’s more than being notified when the memory consumption grows. Usually we also want to know what our application is doing and how it is performing.

The traditional way

Of course, all of our services write thousands of log lines every second. These lines already contain all the information we need, so we read, parse, count and index them using our ELK stack and services like sumologic. This results in data that can be used to create statistics and numbers that we can visualize in nice graphs and dashboards.

This solution has some small issues:

  • Log files are growing with every metric we want to measure.
  • Metrics need to be parsed from log lines.
  • Log lines can interfere with other log lines (e.g., similar output for different metrics).
  • Changes in log lines can break statistics.
  • Parser needs to be configured.
  • Every log line comes with some overhead — text that is not relevant for every metric.

The StatsD way

This is where StatsD comes into play. It was already in 2008 when Flickr posted about Counting & Timing and their ideas for measuring things. Two years later, the developers at Etsy got inspired and released their first implementation of the original StatsD daemon, a simple network daemon which listens for statistics data like counters and timers. It aggregates them and forwards this information to one or more backend services.

➊ But first things first. With StatsD everything starts in your own application. Let’s say you have a function and you want to monitor how often this function is being called. With StatsD you simply send out a UDP request containing some arbitrary unique key and the request to increment the counter for it.

➋ Next, some running StatsD service receives this data, aggregates it over time, and pushes it to some server that is designed to store time-based data. How this data is being sent depends on the metrics server you use here.

➌ Now that you have all your data in your metrics server you can analyze it, build graphs upon it, show them in nice dashboards and monitor them live at runtime.

Tiny footprint

Every measurement influences our application. The calls for storing statistical data must be as cheap as possible. Ideally, it does not affect the execution time of our processes. Plus: It should never ever stop the execution or cause additional problems. We want to monitor our business but our monitoring should never ever influence our business. StatsD is designed to meet these criteria.

  • It’s a pretty simple service — the original Etsy code for the StatsD server was just 127 lines long.
  • The protocol is very simple and text-based.
  • The data is aggregated and stored outside of your application and will not affect your performance.
  • Data is sent using UDP, that means you just provide the information and don’t have to care if it’s consumed or not.
  • Your application sends the data as required, so it can be indexed directly — without further parsing.

Because of this simplicity, you could talk to StatsD without any upstream dependencies. Even if you use a client library for StatsD, it will be pretty small and will most likely come without any other third-party code.

Integration with our application

Enough theory, let’s check out how all this works in practice. On Etsy’s wiki page you can find a list of client implementations for many different programming languages. Many of the services here at willhaben are Spring Boot Java applications. To integrate StatsD with our service, we simply used timgroup’s java-statsd-client, a lightweight Java implementation with no further dependencies.

This service comes with a so-called NonBlockingStatsDClient. It’s a good idea to create one single instance for your application with some modifiable configuration parameters:

Now that we have our instance, we can inject it into one of our service classes and use it directly:

This is the most simple use case for recording statistics. It simply tells StatsD to increment the counter for the given item by one. As already mentioned, the StatsD protocol is text-based so you can easily sniff for your packets or write a simple UDP server like this one which prints them. You should now see a UDP packet with the following content:

What can I measure?

So, this was a very simple example, but that’s pretty much everything you need to know here! Let’s just take a short look over some other methods provided by the StatsD client:

  • incrementCounter(String aspect) … Increments a counter (as we did above).
  • decrementCounter(String aspect) … Does the opposite.
  • count(String aspect, long delta) … This allows you to adjust the counter by a given delta.
  • recordGaugeValue(String aspect, double value) … This allows for recording a fixed value, for example, with this you could record the memory usage.
  • recordExecutionTime(String aspect, long timeInMs) … This lets you monitor the performance of your code.

Check out the interface to see all the available functions. Even though this is the Java client implementation, it should look pretty similar in your favorite language.

Collecting metrics like a pro

Now that we have seen the basics, let’s take a deeper look at how we could improve collecting metrics. I’m sure, in your project, you have some classes where it would make sense to collect metrics for every method call. For example, here at willhaben, we had some REST services with dozens of methods. We didn’t want to add the StatsD call to every single method, so we used AspectJ to surround them:

Now, every method call in RandomController gets recorded. It has been working like a charm for some weeks now in our application.

Creating charts

In our project, we used Datadog to collect and analyze all the data. It’s a hosted cloud service with some nice features, but I don’t want to promote this product too much now. I have also seen some downsides and missed some features — features that I really like when building dashboards in Grafana.

So what’s next? We’re thinking about setting up Graphite in our datacenter, so maybe we should also take a look at Prometheus. Let’s see what the future brings.

What’s not so nice…

As we saw, StatsD comes with many great ideas, but despite all the simplicity, we also found some missing features. For example, sometimes you might want to group metrics by some parameters. Let’s say you collected the response times for all the processed calls to one single REST method; maybe some of these calls failed with some HTTP error, and especially those calls took much longer and now mess up your average.

In this particular case, it would be great if we could also send some details together with our metrics, such as http.status=200. When drawing the graphs, we could filter them and calculate the average over the successful calls only.

In fact, this is one of the nice features we already found in Datadog. They provide their own enhanced version of StatsD, and this client allows us to send tags with every message:

Also Prometheus comes with a similar feature but sadly, this is not part of the original StatsD implementation, and it is not included in the timgroup StatsD client.

When should we use StatsD and when not?

StatsD is a very simple tool that comes with almost no overhead. So it’s easy to integrate and does not affect the rest of your application. It’s a great tool for collecting technical metrics for getting insights to your applications performance, such as:

  • How often is method X called?
  • What’s the current health of your application?
  • How long does it take to process request Y?

In this way, it’s a great tool for monitoring details in your application, details that might help when trying to find performance issues or debugging your application in general. It’s a tool for people who have deep insights into the details of their code — so it’s a tool for the developers.

Your operations team might also benefit from some of the collected metrics but in most cases they’ll measure things one level above your application. Also for your management the metrics collected with StatsD will most likely be far too detailed.

One big advantage with StatsD is, that as a developer you can tell what data you need, and you tell what metrics are indexed. You specify that in your code, in the same repository, so you have the application code and the metrics in the same git history. The metrics you want to collect are specified explicitly, you don’t have the overhead of defining patterns for grepping log entries and you don’t have to worry about the format of your log output.

Finally, it should be mentioned that StatsD will never replace your log files. In some cases, you’ll still need to find out what happened exactly, to read through the history of some events, and in these cases the aggregated data from StatsD is not sufficient.

Further reading

--

--