Implementing Service Level Objectives

I picked up a new book recently that is a great companion to the other SRE books published by O’Reilly.

This book[1] goes much deeper into defining and refining SLOs than the chapters devoted to the subject in the original SRE books.[2][3]

SLOs are not a panacea, but can be central to defining what level of reliability is good enough and when to pivot to focus on reliability rather than new features. Developers and operations used to be at odds; developers wanted to get “new stuff” into production, while operations was concerned with stability and “the devil you know..’‘. Stability and reliability used to be primarily an “operations problem”, and “not my problem” as far as developers went.

SLOs and error budgets are a means by which both sides can work together as a team to agree on and maintain a level of reliability that is acceptable to users, customers. It bridges the gap to make reliability everyone’s concern.

None of this is news to anyone who has read the SRE books, but many companies have done little more than renaming their operations team to an “SRE Team”, without laying down the right groundwork. This book dives into the practical aspects of actually developing meaningful SLIs, SLOs and getting buy in from stakeholders.

It is a must have for anyone working with reliability of their services.


 

[1] Implementing Service Level Objectives

[2] “Service Level Objectives,” Chapter 4 in B. Beyer, J. Petoff, N. R. Murphy, and C. Jones, eds., Site Reliability Engineering: How Google Runs Production Systems (O’Reilly Media, 2016)

[3] “Implementing SLOs,” Chapter 2 in B. Beyer, N. R. Mur- phy, D. K. Rensin, K. Kawahara, and S. Thorne, eds., Site Reli- ability Workbook: Practical Ways to Implement SRE (O’Reilly Media, 2018).