Modernizing willhaben’s Data Analytics Infrastructure

Published in

willhaben Tech Blog

5 min readApr 9, 2021

Background

At willhaben, some 421 million ads have been published since the company was founded in 2006. Around 7 million searches are executed daily on willhaben by about 6 million active monthly users, which makes us the largest internet marketplace in Austria.

Until a year ago, our entire reporting and analytics infrastructure was based on a nightly backup-restore of the main production database. This process was augmented with some additional basic click-stream data, aggregated using Apache Spark, and stored in Cassandra. In addition, overall traffic data were collected via AT Internet (formerly known as XiTi). This setup has worked well for years and has helped willhaben make many important decisions along the way.

However, as our usage grows, demands for more detailed analyses based on ever-growing amounts of data also grow, bringing the existing analytical infrastructure to its limits. For example, analyzing billions of search events was possible only on a superficial level, but we could merely fantasize about taking a deep dive into the interesting patterns hidden in this pile of data. More detailed ad hoc analysis required special knowledge of Apache Spark — along with slow iterations of coding, deploying, and running the batch jobs — which created a barrier that stopped most people from getting their hands dirty digging in the data.

Furthermore, with progress made on splitting the monolithic setup into smaller, more manageable services following domain-driven design patterns, the simple backup-restore process of the production database will soon no longer be possible.

This led us to think about the future of willhaben’s data and analytics infrastructure. Since we work in close cooperation with Adevinta, one of our two parent companies, which has built a generic data exchange platform, we wanted to leverage the available infrastructure and break up willhaben’s data silo. This would also allow us to benefit from centrally developed global software components we could then integrate into our platform.

Data Warehouse: Snowflake

We started with an evaluation of state-of-the-art data engineering and analytics platforms and then focused on the evaluation phase with Snowflake, a modern up-and-coming cloud data warehouse. After running some proof-of-concept data pipelines, we were impressed with the scalability of Snowflake, as well as with features such as querying your tables at any point in time in the past and its separation of storage and compute, allowing compute to be billed as a pay-by-use model.

Model Transformation: dbt

For the actual data transformations in the pipeline, we decided to go with dbt, a Python-based data transformation tool that represents the data pipeline as a series of SELECT statements. The tool determines the order in which models have to be updated and also provides features like data quality checks.

The data is modeled using the Data Vault 2.0 approach, a modern paradigm that models the enterprise data warehouse, as it allows for agile onboarding of new entities and data sources. For better support of Data Vault 2.0, we worked hard to build generic macros into dbt to simplify model building.

We also wanted to ensure high standards during development, so the models were tested using dtspec during the build processes, which populates an empty database with test data and verifies the outcome of each transformation in dbt.

BI-Tool: Tableau

At the same time, willhaben’s data analytics team also evaluated various BI tools that could be used to present data to the user. As Adevinta already hosts and uses Tableau as their main BI tool, the decision was relatively easy to move forward with Tableau as the main tool. In general, the separation between data warehouse and BI tools allows us to switch later to a different tool if the need ever arises.

ELT-Pipeline: Argo

To store raw data and be able to replay our pipelines if we detect issues, we follow an ELT (Extract-Load-Transform) approach in which raw data is stored unaltered in S3 as it arrives, forming our data lake. As we were discovering data quality issues in our pipelines, we were able to drop affected tables into the data warehouse and do a full reload of all data on several occasions. This is a strong property of ELT pipelines that follow a functional approach. A more in-depth look at the functional approach to building data pipelines can be found here.

To orchestrate the various steps of the pipeline necessary for processing newly arrived data, we decided to use Argo as an orchestration engine as it naturally integrates with kubernetes, which we are using across willhaben.

The main steps of the pipeline are:

Incrementally load new data from the data lake into the staging areas
Run data quality tests to ensure data is in accordance with our expectations
Execute dbt to transform and insert the freshly arrived data into the core
Build the data marts from the core data, which serve as the basis for the Tableau dashboards
Refresh dashboards in Tableau to reflect the latest data

MVP Setup

The pipeline currently runs once a night and loads approximately 20 million rows into the staging area. To monitor the pipeline, we have defined and are emitting operational metrics, such as table and data latency, which tells us if models have not been built or data in the tables is stale.

To better visualize our architecture, here is a high-level overview of the full pipeline:

MVP Setup for Willhaben’s Data Analytics Infrastructure

We will add, upload, and create new blog posts in the near future to dive into different aspects of the data warehouse ecosystem. Possible topics include our data vault 2.0 models, how we implemented unit tests for all models, data quality tests, and more. Please stay tuned.