May 15, 2017

From Batch to Stream Processing

At willhaben, a lot of data is passed around between components. At peak times, we handle over 100,000 events every minute. Saving this huge amount of data does not create bottlenecks on modern hardware but carrying out proper (real-time) analysis of data on this scale can become a challenge.

These events sum up to terabytes and getting real-time statistics from them can be really valuable for monitoring either IT infrastructure or business processes. Big Data comes to mind, a buzzword that covers a lot of topics. One of these concepts, often seen in this context, is stream processing. This is a very extensive topic so this post will only talk about some of the ideas behind streams and stream processing, and not go into detail. There are several frameworks for stream processing like Apache Spark, Apache Flink, or Kafka Streams. Each of them has its own advantages and disadvantages. A comparison between them is beyond the scope of this post. Here we will focus on Kafka Streams as it is the simplest one and covers our user cases. We will first cover the basic concepts and then look at the processing pipeline implemented at willhaben.

Prepare your Architecture for Stream Processing

Stream processing can only be done if there is a continuous data stream. As shown in the figure below, at willhaben, several systems, and clients like Android phones, or desktop clients are publishing events on an event bus infinitely. Depending on the current architecture and the system, preparing to actually get to a system architecture that enables event based communication can be quite a challenge and can take years. Legacy systems use a lot of stored procedures and need to actually move the business logic to the code to be able to publish events on an event bus. Unclean component design and software architecture can also be quite a challenge to fire proper events. Furthermore, data quality is king. Streaming analysis is worth nothing if the data that comes in is wrong. This is a challenge for sure and it’s never actually achieved. Your architecture needs to fulfill certain criteria exactly once delivered or work as a buffer if there is system downtime.

Modern micro-service architectures and already decoupled components make it quite easy to start stream processing. Depending on your system, the preparation step to implement the first streaming job can take 99% of your time. As this is a very complex topic it cannot be covered here. Let’s go on and have a look at the differences between batch and stream processing.

Stream Processing

All the data floating around can be seen as an infinite stream of data. Instead of using batch jobs, running every X minutes, on static files or a fixed data set to get some metrics, the calculations are on the fly and ongoing. A data stream may be events flowing in and out on a Kafka Queue. A stream processor can deal with that stream directly without storing the events to a separate database and does not need to query the data first. This can be very useful for functions like aggregations – How many events have been received in the last 24 hours? Or filtering data for fraud or anomaly detection for example. Due to the in memory aggregation, streaming jobs can handle a huge load of data and are horizontally scalable. The figure below shows the main differences between batch and stream. For us, the most important one is the low latency. We are getting our results faster. Now let’s explain the most important concept of stream processing: windows.


Windows are a very important concept for stream processing. A window adds a time component to the analysis to answer questions like “How many events have been sent in the last hour?”

Different stream processing frameworks provide support for different window types. The two most important ones are the tumbling window and the sliding window. A tumbling window is non-overlapping. This means that one event can only be part of one tumbling window. The figure below illustrates the concept of a 20 second tumbling window over one minute. It has fixed time slots of 0 to 20, 21 to 40, and 41 to 60 seconds.

Each window contains a start and end time stamp and the records. In comparison, a sliding window is a continuously forward moving window and an event can be in more than one window at a time.

Stream processing search events and creating a top list

Let’s bring this all together and build a top list of keywords that are searched for at willhaben. Our pipeline looks like this:

Users are searching for keywords and trigger events in the search component that are sent to an event bus. The stream processor listens directly to these events and generates a tumbling window. These windows are also published on an event bus and a dedicated service writes these windows to the database. Each record looks like this:

Why not save all the searches and just group, count, and sort them? Because of the scale. Saving the search events, which add up to a couple of hundred million a month, is not the problem. Big Data stores are handling these volumes really well. But what Big Data stores cannot handle very well are operations like group and count; they are just not able to provide these features, they are not suitable for online applications, or they are made for background batch processing. Furthermore, most stream processing frameworks cannot sort either. These limitations are because of the nature of the framework design.

If you want a real-time top list of your keyword stream processing it can be one very important part of the pipeline. In our case, we use Kafka streams to aggregate the top keywords every hour. Instead of heaving over millions of records for each day we just store the top 5000 keywords every hour which sums up to 120,000 records a day which a Postgres database can handle quite well. Using these records, we now have a historical store for the top keywords and can use them for analysis. Want the top key words for the last 24 hours? It’s as easy as this sql query:

select sum(search_count), sum(total_search_hits) from search_frequency_record where window_end > '2017-05-01 14:36:53.235' group by keyword order by sum(search_count) desc limit 5000

No Big Data storage or a heavy Spark cluster installation needed.


Streaming jobs require a new way of thinking about processing data. Instead of running heavy batch jobs and shuffling a lot of data, streaming jobs provides a way to achieve near real-time data transformations and analytics. This can be an advantage from a technical point of view, like better resource allocation due to a lower but more stable demand on resources, and from a business point of view, as in being able to react to events without delay. Stream processing is not only for Big Data environments, it is also for small scale environments which can also profit by implementing these concepts.

No comments:

Post a Comment