Shifting Reliability to the Left

Andreas Deuschl
willhaben Tech Blog
3 min readSep 30, 2019

--

How we moved our traditional IT Operations team to a state-of-the-art Site Reliability Engineering model to make reliability a matter for everyone, not just operations staff.

Motivation

We had a central IT Operations team with a focus on running production services, working with developers on new features and components and providing engineering productivity tools to developers. As production systems were exclusively managed by IT Operations staff and developers had no access to production, IT Operations was considered to be wholly responsible for the reliability of production services.

While this operations model worked well for some time, we discovered the need for change as our organisation and our infrastructure & application landscape continued to grow.

In Spring 2019 we started working on a new model with the following goals:

  • Reliability should be the concern of all development and operations teams
  • Have a clear focus in the operations team of managing the operations platform and infrastructure and working with developers on new features
  • Better align the interests of development and operational teams
  • Allow development teams to work more autonomously

The new SRE team

Our new Site Reliability team consists of

  • Product SREs: Reliability engineering of features and components in collaboration with development teams
  • Platform SREs: Managing the infrastructure platform and production operations
  • DB: Database development & reliability

Although we had already started to follow the principles of Site Reliability Engineering, we decided to rename the team from IT Operations to SRE to reflect this stronger commitment to SRE principles.

With Product SREs now working directly as part of development teams (scrum teams), we are also following the methodology of the Whole Team Approach.

Product SREs are fully integrated in their scrum teams and bring in the reliability expertise necessary for bringing new features and applications into production reliability, including capacity management, performance testing, provisioning of new systems, deployment & monitoring of applications, incident handling and so on. They act as an advocate for reliability in their teams to promote reliability as something that is of concern to every developer rather than just people in a “Ops silo”.

Platform SREs work as a single team on the infrastructure and operations platform, with the aim of allowing development teams to run their applications more autonomously.They focus on platform reliability and production operations. A large part of this platform consists of shared services such as networking, load balancing, databases, Kubernetes, cache infrastructure, search services and so on. The main runtime platform for applications consists of Kubernetes combined with a range of observability tools including Prometheus, Elastic Search and Grafana, as well as a number of internal engineering productivity tools.

Database developers work closely with development teams and support them by keeping the core database systems reliable. They assist with feature development, review pull requests related to database usage and take care of maintenance tasks.

Collaboration is a crucial factor when distributing responsibility for site reliability. We strive for a close collaboration across our SRE teams and use things like regular Production meetings, Slack channels for sharing information about incidents and planned changes, and pull requests for infrastructure code. On-Call is currently a shared responsibility among all SREs, so that one SRE needs to be on-call at any one time.

But has the change been a success? We have now been working for 3 months in the new structure and have gathered a lot of early feedback from all those involved. Based on that feedback we can say that yes — so far it is a success. As part of our continuous improvement culture, we are constantly gathering more feedback and taking steps to refine the processes and working practices.

--

--