How to do a complex infrastructure migration during a pandemic

Published in

willhaben Tech Blog

6 min readMar 29, 2021

willhaben is Austria’s largest marketplace app with more than 3.6 million users per month. Our applications are used 7 days a week, 24 hours a day. Service reliability is a critical factor for our business.

A while back we decided to move our whole infrastructure to a new provider, a cloud service provider locally based in Austria. To put it into numbers, we had to migrate about:

250 virtual machines
50 bare metal servers
50 TB of data in many different datastores
20 databases on different clusters/systems
>100 different applications and services running on k8s, VMs or bare metal

from two source data centers to two destination data centers at a new service provider, with a third data center where future backups would be located after the migration.

All without affecting user traffic of:

>70m visits/months with
>30bn https requests/month

And we should not heavily influence other important ongoing development projects and need to allow continuous delivery of new features during the migration period. Challenge accepted.

The migration was not just a lift and shift. Basically we

rebuilt infrastructure and systems mainly with existing and adapted automation (infrastructure as code)
got rid of some technical debt from an infrastructure that had been growing for over 10 years while keeping a balance between refactoring and migration work
got better equipped for reliable and secure operations with cloud services for infrastructure (IaaS) in combination with housing of bare metal servers and the ability to get direct local support

The Plan

We had planned to do a step-by-step migration, not a big bang. Inevitably some steps involved “big bang” changes, but overall we can call it step-by-step.

It was the end of 2019 when we entered the planning phase. We structured high level migration phases into multiple waves:

Wave 0 — Core architecture

The first step included necessary preparation work, including the design of network architecture and implementation, infrastructure architecture, capacity planning and service migration planning.

In this phase we created a reliable migration link between the old and the new provider including the necessary networking and routing setup in order to allow hybrid operations as the basis for further migration waves.

Wave 1 — Migration of highly dependent systems

Based on a dependency mapping we identified systems which most applications and services are dependent on. This mainly involved databases, data stores, search technologies as well as messaging broker and command & control services.

Wave 2 — Migration of the majority of applications systems

While user facing traffic was still served via load balancers from the old data centers, application systems could be migrated step by step by creating additional instances at the new provider and gracefully shifting over traffic before decommissioning old instances.

Wave 3 — Migration of less dependent systems & external traffic switchover

Less dependent systems are systems that are not highly production critical and also non-production environments.

While waves had some interdependencies, overlaps were also considered. One major milestone was switching over user facing traffic to the new data centers somewhere between wave 2 and wave 3 at a time where the majority of systems were running at the new provider.

Wave 1 (of COVID-19)

Just before we could enter “our” wave 1, the upcoming COVID-19 situation in March 2020 has hit us hard.

The overall situation was challenging for everyone personally and we had to deal with challenges from a business perspective. The migration project was heavily influenced by COVID-19 as data center work got heavily restricted and back in March and April 2020 nobody could predict how the situation would further develop. Planning this project became difficult, almost impossible.

Our new approach

To put it simply, we took our surfboards and rode the waves, no matter which one was coming next. If you stop riding, you will sink.

https://pixabay.com/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=1246560

Agile Project Management

Our original plan got mixed up, the migration waves got completely overlapped and in an agile approach we have continuously examined possible next steps, still with the principle of avoiding risk and downtime. While we applied agile planning principles already before COVID-19 started, this really helped us in planning during difficult times.

Cornerstones were

Plan for frequent migration delivery
The initial technical setup allowed us to plan smaller packages for application or service migrations that could be delivered more autonomously than it would be with a more sequential oriented waterfall approach.
Ownership & Collaboration
We did not change our team setup for this project, nor built a special task force. We worked across teams and distributed work packages in the migration project around existing component ownership. This allowed us to stay within established team delivery processes, ownership definitions (teams felt responsible to migrate their components) and to use the domain knowledge of each team on migrated components. A key factor was to establish good cross-team relationships and collaboration formats for the people working mainly on this project.
Build, measure, learn
With the parallel phase between the old and the new provider, we could prepare new systems for test migrations without affecting user traffic. For most systems it was possible to gradually shift traffic to new systems and get immediate feedback with production traffic to enable fact-based decisions whether to go forward or to adapt things before going forward.
Take decisions as needed and consider changing requirements even late
In such a big project a lot of decisions are needed. We did them at the point in time where they were needed and were not shy to reconsider them in case something was not working as expected. Not everything worked at the first shot and also here good collaboration between teams helped to keep things moving forward.

Collaboration & Relationship

We maintained good and transparent communication with internal stakeholders. This gave us the necessary management support and understanding of the situation (instead of deadlines getting pushed top-down).

We also have a good relationship with our service providers. That’s part of the willhaben culture — not only internally but also in dealing with external companies we work with. Both the new provider and the old provider (who was losing us as a customer with this move) were flexible and supported us where they could. We assume this would have looked different in the case of a purely contract-driven relationship.

Flow

The essential part is that we kept making progress. First smaller steps, later bigger steps. The key point is that we did not stop and freeze the migration entirely. This kept us moving towards the time where we could take bigger steps and also could move hardware. Every week we could celebrate the migration of another component without major downtime.

While this was good for morale, it would not have been possible without a great team of motivated people standing behind this migration approach and working together towards the common goal.

In March 2020 we migrated the first non-production systems, followed by the first production system in April 2020.

In June 2020 we finally started moving bare metal systems, and by that point had already migrated about 75% of non-production and 45% of production systems.

By mid-August 2020 numbers were up to 80% of non-production systems and 70% of production systems migrated. All user facing traffic was served from the new data centers from this point in time, again with no major downtime during migration.

A couple of weeks later we could turn off the last machine at the old provider.

Conclusion

It’s not a new finding that applying agile principles to project planning helps with tackling unexpected impediments. It did not avoid the pandemic having an impact on our project, but it definitely helped us with dealing with it and following the flow until we finished this exciting ride.

And the most important thing is having a great team. Without motivated, experienced and focussed people you cannot do such a migration even without a pandemic. Big thanks especially to the willhaben SRE crew!