Securing the monolith

Marko Jurišić
willhaben Tech Blog
6 min readDec 17, 2020

--

Photo by David Boca on Unsplash

Security is hard. Extending legacy software is sometimes even harder. But meeting deadlines while combining those two and maintaining code quality, while also staying sane, is the hardest.

The central part of our architecture is an almost 20-year-old Java monolith. Software years are like dog years, especially in today’s fast-changing world where so much can happen in just a few months.

One of the long-desired new features of our system was support for peer-to-peer transactions and wallet functionalities. Since these involve real money transactions, we had to invest additional effort into bringing the entire system’s security to an industry standard, proven level.

Our main constraint was that the system as a whole should work as before, and that already logged in users must never be logged out, since some of them would probably not be able to log in again, due to forgotten passwords and expired email addresses. There is an old German saying, “Everything stays better,” and we accepted it as our project’s motto.

From self-made security to industry standard

Our system’s security was not actually bad in the first place (e.g., the passwords were salted and hashed using a blowfish algorithm), but still there were some quirks, so the decision was made to move away from the in-house implementation and use an industry standard.

We evaluated several standard solutions and made the decision to go with Red Hat SSO, a supported version of the open source Keycloak project. Keycloak is being actively developed via Red Hat and the open source community, but it does not offer patches; instead, they are applied via upgrading to the next version. By contrast, Red Hat SSO offers both regular security patches and technical support.

Monolith integration

The first steps of integration were easy. Setting up the development environments and configuring Keycloak for local development and Red Hat SSO for integration testing in a production-similar environment were straightforward. We defined the first clients and roles, created some test users, and could already log in to our monolith using single sign on — success!

We had some non-standard behaviors in the monolith that we tried to challenge, but eventually we had to keep them, due to business constraints. This meant we had to find some workarounds to keep the exact same functionality.

One example is the registration form — by default, a user has to enter a password and a password confirmation on the registration page, and our registration process has just one password field (to make registration as straightforward and streamlined as possible). Enter SPI — Service Provider Interface, a Keycloak concept that allows us to extend and modify different parts of the Keycloak system, such as user registration validation, by stating that just one password field is good enough.

Keycloak also provides a way to define themes for different client applications, so we could implement the login screen as it was before (so the users would not notice any differences from the previous system — everything had to work in the same way as before, just better and more securely). In the end, exactly one user called our support line to ask if it was okay that he was forwarded to a single sign on subdomain to log in. We were really happy that someone had finally noticed our months of hard work!

User migration

After the initial user migration from the legacy database to the new system, we had to continuously synchronize the data between systems. Kafka proved to be invaluable for this purpose (Kafka provides high throughput, low-latency stream processing out of the box). We also wrote a few tools that continuously check and log any inconsistencies, displaying the data on a Grafana dashboard.

Challenges

One of the main problems was that we had to work on a live system that was being actively developed as we were integrating these new security features. Although we used unleash feature toggles, there were some collisions in the code — we stepped on the toes of other teams a few times, and vice versa. It was practically impossible to release an MVP (minimum viable product) and build from there, because we had to be sure that no users would be logged out and that there would be no security breaches. I always said, “As long as we don’t read about our project in the newspaper, I consider it a success.”

Another problem was that the legacy project was so big that nobody knew all the nooks and crannies and ancient parts. For example, after releasing the new login for the first group of users, some of them reported that the internal customer relationship management software was no longer working. Luckily, it was just a matter of adjusting the iframe permissions, but there was another, more interesting, issue with single sign on and Microsoft products: our employees often use Excel reports with links to different parts of the system. Those links would always redirect the user to the login page (although they were already logged in in the same browser), but copying and pasting the link directly into the browser’s address bar would work just fine. The problem was that browser cookies were disregarded for links from Microsoft products, so we had to write an additional filter to circumvent this problem.

The first big challenge was that we have two big user groups, standard users who use their email address as their username, and professional users who have a company ID and a manually assigned username as their form of identification. We tried to consolidate them and have everyone log in via email + password, but we had to give up on this after a few discussions because there were too many duplicate accounts — the most extreme user had 46 different accounts in different companies belonging to the same mother company!

The second big challenge was user storage. We integrated the first applications and wanted to do some load testing before releasing the changes to production. We imported a subset of the user dataset, about 1 million users, in the Keycloak internal database, which made the Keycloak-provided admin interface unusable. The default indices were set so that login worked, but any other user operations, such as searching for a user in the Keycloak admin interface, would last about 20 minutes with just 1 million users (we have about 6 million users in total), because internally Keycloak did a wildcard search on all columns of the user table. In the end, we wrote a user management component that just holds the user data, and Red Hat SSO accesses it via User Federation SPI.

Go live

There is a thorough test phase before releasing any feature to our customers, and we had to test even more for this big release. We tested all possible and impossible scenarios and feature toggle combinations.

Using unleash feature toggles made this much easier, since we could test every feature first with the company’s internal users, and then external users with increasing percentages. We switched this on in increments: first for 1% of the user base, checked the logs, and asked if the support was getting any complaints, then in logical increments: 7%, 22%, 42%, 69% -> 100%.

We went live in November 2019 with the business users, and in June 2020 with all users, without any bigger problems.

Future outlook

We have been running stable for a few months with desktop and mobile web integration and plan on releasing integration with apps in the first quarter of next year, completely removing legacy cookies.

It’s almost done, and it just works — but better.

--

--

professional software craftsman, part time researcher and PhD student, amateur musician and cyclist. currently working for willhaben.at