Spring Statemachine — The Heart of PayLivery

Published in

willhaben Tech Blog

6 min readMar 8, 2021

In August 2019, we here at willhaben started working on a new service called PayLivery. This service manages interactions between a potential buyer and a potential seller, from the first contact to the moment the item is paid for and received by the buyer.

We are in the process of migrating from a monolith to a microservice architecture, meaning that, besides our old monolith, there is an internal separate microservice for payments. We also have an external partner who provides the microservice for delivery tracking. In addition to that, our sister company provides a microservice that is a chat application similar to WhatsApp, which is used for buyer–seller communication.

willhaben is a marketplace for classified ads, so users can enter ads for their items and the ad lifecycle is tracked by the monolith.

The great thing is that there is already a lot of existing software that could help us significantly reduce our time to market. We decided to create an orchestration service that coordinates the communication among existing services. For this purpose, we tried the Spring Statemachine framework.

The rest of the blog post will summarize our experiences in this regard.

Spring Statemachine motivation

Spring Statemachine (SSM) is an implementation of the state machine design pattern offered by the Spring team. Each state machine consists of a set of states and transitions between them.

The first obvious question that occurred was the following: The state machine is quite a simple design pattern, so do we really need a full-fledged framework? The answer to this question will obviously differ from situation to situation.

In our case, we’re modeling a business process, one that’s quite complex, as it involves a phase during which the buyer/seller must agree on a price, followed by separate phases for both the payment and actual delivery. All of the phases are complex in different ways.

SSM integrates well with Spring Security and Spring Web, and it’s easy to integrate with our observability platform (by implementing logging and Prometheus interceptors). There are a lot of great tutorials on SSM, and the documentation isn’t too bad, so instead of writing yet another tutorial on SSM, I’ll focus here on the problems that we encountered and possible solutions for these situations. I assume you, the reader, know basic state machine concepts like states, transitions, actions, and guards. If not, Google is your friend.

The dual-writes problem

The challenging part is keeping distributed transactions atomic. Imagine a situation where a user has to complete a step that involves several non-read rest calls. Now, if one of the non-read calls fails, it creates a data inconsistency that is quite difficult, and sometimes impossible, to fix. Even though it might seem to be a rather unlikely event, with increasing load and an evolving code base, this happens quite frequently.

The clean way to solve this problem is to allow only one rest call per action. Therefore, each transition in the state machine would be a rest call, and each transition has to have an inverse transition that reverts the effect. In the state machine context, we can store the stack of inverse actions. Once an action is fired, an inverse action is pushed to the top of the stack. If an exception is thrown, we can take the stack and apply the inverse actions one after another to return back to a consistent state.

The above-described solution is the implementation of the SAGA pattern using an SSM. However, this solution is quite clumsy, as it requires an SSM with named states for each “transaction”. So, in the end, we decided to not use a generic solution and solve each pair of dual writes ad hoc. There are better libraries for SAGAs out there :).

The rollbackable states and action side effects

Due to technical issues or simply human mistakes, our state machine can get stuck. This is quite a big issue, because the user cannot proceed with their transaction (which usually means he can’t pay or get a refund). A great solution would be to have an undo operation for each transition, as it’s hard to predict which one will fail. With such a feature, customer support could then resolve issues without the help of developers.

SSM delegates persist in the SSM context to the underlying persistence framework. In our case, it’s Hibernate, which is great, because SSM has only basic auditing support, and we could use Hibernate Envers to get an overview of the whole transaction history and introduce the option of rolling back the transactions.

In theory, it should now be possible to do a rollback of all our actions. Unfortunately, some of the actions have a side effect — they render a page where users can insert data, and the insert is done in a separate transaction. As of this writing, we haven’t found a solution to fix this, but it should be technically possible, so it’s only a matter of time.

Scheduled transitions

SSM provides a simple mechanism for timed messages based on the ScheduledExecutorService, which is not bad, but among our use cases there were actions involving money transfers, such as automatic refunds after several days, and we couldn’t afford to lose any of them in the case of any downtime. We wanted to allow retries and to be able to persist the scheduled messages, as we run PayLivery on a Kubernetes cluster and we needed a way to synchronize scheduling among all Kubernetes pods.

Quartz works really well for us, and we have had no significant problems with it in production.

One only has to be careful, because it’s also possible to introduce retries with the state machine simply by creating a cycle within it (e.g., we can have a state machine with states A, B, C and transitions from A → B, B →C, and C →A, which form a cycle, so all actions attached to the transaction can repeat). Which solution is better, Quartz or cycle, depends on the situation. We decided to use as few cycles in the state machine as possible.

The error state

This is a must-have once you make the application available to end-users (either via feature toggle or after the first release). There must be one state that is dedicated to the error state (call it “support contacted,” if you will). Once the state machine is in this state, a person (let’s call him a “customer support agent”) should be able to fix the damage caused by the error on his own (such as issuing a refund). It’s a good idea to implement this state before implementing any others in the state machine.

Conclusion

In my opinion, the two key aspects of software development are predictability and scalability.

Writing a vanilla scalable state machine from scratch is a quite difficult task, as the state machine is a shared state that has to be synchronized among all your pods. The SSM abstracts you from the low-level technical aspects of scaling it, which means more time can be spent on tasks that actually bring user value.

Generally speaking, the predictability is a bit more difficult to justify. Though it will, indeed, take some time to learn the framework, it also forces developers to write better and more readable code, which is actually a good thing to have — even more so, if you want the developers to commit to a specific deadline.

Coming to our final question, why use SSM and not some other tool? Well, for one, it integrates well with the rest of the Spring ecosystem. Also, and for our specific use case — which is a bit more complex (we have more than 30 states and 100+ possible transitions), since we are dealing with payment and delivery processes — it suits us well. It’s really lightweight compared to BPMN solutions; therefore, it’s easier to learn. At the same time, it’s a mature and reliable framework that is both easy and fun to use.