The problem domainAt willhaben we currently deal with around 120.000 new ads and around 200.000 ad changes per day. For this data we need to make sure that the data adheres to our quality standards and to our terms & conditions, e.g. the ad must be placed in the right category. Also we want our users to be fair to others and not reinsert an ad in order to be pushed to the top of the search result.
For this reason we've developed a system where the people in the quality assurance department can define rules which implement the current quality requirements. The rules can be created, changed and deleted at runtime, making the system very flexible for rule editors because no developers or deployments are needed for changing business rules.
Within the rules, a rule service provides access to the underlying datastore where we need to run realtime checks on the data we have. By realtime checks we understand queries that are able to return within milliseconds. The goal is to provide proper results at insertion time. Also, as rules change in the future, we need to be able to look at historical data from a different perspective in the future where we don't know the actual requirements for it yet.
Event Sourcing and CQRS in a nutshellEvent Sourcing is an architectural pattern which introduces the storage of state changes rather than state. A certain state in the data store must be computed out of a series of changes, also called events. Events are stored in an event store where an event handler picks them up and puts them to a read store. The read store typically has a read model optimized for the application.
Event Sourcing was the topic of Greg Youngs keynote at the conference Voxxed Days Vienna 2015. Having attended the talk, thoughts in my head changed. After almost 10 years of implementing the typical 3-tier approach in Java Enterprise Applications I was just fascinated by the content of Greg Youngs keynote. He emphasized that in a typical scenario, your application uses one database and it is used to store primarily state. For instance if you store user data and a user changes the name, the name will be updated and you loose the information that led to this state. He stated that it's much better to store all events that lead to a change of the state in a way that you can create state out of it. For that it is essential that those events are immutable and cannot be deleted. If a faulty event occurred, a compensating event has to be created. He brought the example of accountants who do the same. They have their journal of immutable invoices and documents which create a certain state on the accounts.
Having learned accounting in business school, this all became clear and obvious to me. Digging deeper into the topic of Event Sourcing, you'll quickly find that Martin Fowler approached this topic around 10 years ago. Also, in conjunction with Event Sourcing you'll find CQRS (Command Query Responsibility Segregation) which is basically about using a different way to store data than reading the data. The command part of CQRS is responsible for writing data whereas the query part is used to read data. This makes perfect sense in case you want to scale reading differently than writing.
System designHowever, as we started to redesign our product in terms of persistence layer and underlying datastore, we had the chance to refine our strategy on storing and reading data. We realized quickly that we can solve the requirement of having to index historical data (back to a certain number of days) in different ways with the Event Sourcing approach.
We decided to use a relational database (PostgreSQL) as event store with json data types to be flexible regarding the event data model. We treat the event store as single point of truth, thus the ACID properties of a relational database help us to implement that properly.
The event tables have the following schema:
We have one table (event_data) that stores the data which itself will be stored as json to remain flexible regarding changes on the client side. The table also has a foreign key to a table (event_definition) storing some meta data about the event type. The other table (event_data_indexation) shall keep track of the indexation state.
As read store, we decided to use Elasticsearch for now, knowing that we could add another read store to the system without having to change the architecture. Elasticsearch is an open source analytics and full text search engine. It is based on Apache Lucene and organizes its data in so called indexes. High performant search capabilities are provided through different interface mechanisms.
The event handler is a standalone application which reads events from PostgreSQL, transforms them to domain objects and indexes them to Elasticsearch. The application only reads from Elasticsearch, it does not read from the event tables.
With this kind of application design we get an eventual consistent approach for our data because an event is not immediately indexed after it is stored. Eventual consistency means that the data stored is not available for searching right after the write operation. This approach is also driven by Elasticsearch and it's bulk indexing API.
Current statusWe finished the first use case with this approach recently and are currently working on some refinements on the event handler regarding index management in the read store. So we currently handle one type of event with event sourcing and CQRS. We have chosen a - in willhaben terms - mid-sized use case to be able to get into that approach without being overburdened by our most complex use case.
There are already a couple of - let's call them - challenges on our radar which we will have to address soon.
- We need to consider storing event dependencies for fast retrieval of related events
- Validation of event data is more complicated then with a typical data storage, because the event store is normally not designed to run a data validation and the read store is eventual consistent.
- Special attention has to be paid to concurrent updates triggered by users. This is often solved with Optimistic or Pessimistic concurrency control but in an Event Sourcing approach, this is different.
We've used the replay mechanism already for a reason we did not actually plan. For a migration of the Elasticsearch cluster to new hardware we were able to easily configure our event handler to run a replay of all the events but with the new Elasticsearch cluster as target.
The business people are currently not fully happy with eventual consistency, so we expect changes in the design soon.