The current state of confusion around what a “Site Reliability Engineer” (SRE) role is..
Looking around at job descriptions today and you might be forgiven for the confusion that seems to be around what SRE means and how it relates to “DevOps”, “systems engineer”, “platform engineer” and others. While Google should be congratulated for the original books describing how they created SRE, the books were largely about how Google implemented it. Part of the confusion may come from the fact that SRE may be implemented in many different ways depending on the size and structure of the organization, but the principles (should) remain the same.
Ben Treynor is credited with coining the term and describes it here.
“Fundamentally, it’s what happens when you ask a software engineer to design an operations function.”
As I see it, many of the principles that both DevOps philosophy and SRE role adhere to can be derived from:
“Accelerate the delivery of systems into production, with confidence, while protecting the user experience.”
Some of my other favourite principles:
- Automate repetitive grunt work and complicated procedures.
- Observability; we can’t fix what we can’t measure.
- Protecting the user experience by developing SLO’s that Eng
and customers can live with.
- K.I.S.S – simple gets done faster and done beats best.
- No such thing as “dumb questions”, share the knowledge.
- Self-correcting problems and short feedback loops where possible.
- Doing more with less. “Performance Engineering !!”
- You can’t remove bottlenecks, but you can often “move” them
somewhere else to gain performance..
- Small, frequent changes are best.
- Avoid surprises in production.
- Reduce MTTR, Reduce the cost of failure. PIR (post incident reviews).
- During incidents, mitigate customer experience first.
- Learning from mistakes, incidents and, wherever possible, from others.
- Read, research, read some more. cheaper when done up front.
- Listen and make sure the right problem is being solved.
- Assume nothing, follow the data.
Without the seemingly endless resources of Google, Facebook or Netflix to blaze trails, new tooling often involves doing research. It’s rare to be “first” in solving any systems problem. Learning from the triumphs and failures made by other organizations, people online and through meetups is usually cheaper and quicker than first hand experience.
Chapter 12 in Seeking SRE has dozens of definitions for SRE and DevOps by different people in the industry. The whole book is about how SRE can be implemented by companies other than Google.
How is your organization implementing SRE?
Would you like some help with your SRE team? I am currently looking for my next SRE role and you can contact me via email: billduncan-blog (at) servermetrix dot com.
. Read the Google Books here: https://landing.google.com/sre/books/