We were heading back from the practice area to the airport. I didn’t have my pilot license yet and my instructor says: “Push the throttle to Rental Speed!”.
The Aviation Connection
Normally “full throttle” is restricted to take-offs in small planes, with a cruising speed somewhat less to better take care of the engine. An airplane “owner” would never push the engine like my CFI had suggested, as engine rebuilding is an expensive proposition.
It’s a bit of a stretch for what I really want to talk about, but if you “Own” an airplane, you are far more careful about the care and feeding of the engine. You will be more familiar with the sound of the engine, the quirks for that specific flying machine. If the engine fails in your car, you can pull off to the side of the road. In the air, surprises like that are even less fun.
On-call vs. Site Ownership
I currently work for a company that has their application stack effectively running in well over a dozen environments with several cloud providers. They each have their quirks, not unlike airplanes.* We know there are many differences between cloud providers and there will be different sets of customers in each environment, meaning that capacity planning will be different, what’s on order will be different, the capacity on hand, how to deal with some of the issues, etc.
This poses challenges for on-call; to have deep knowledge of all the subtle differences for each environment when things go south, especially for things which could have been prevented.
If we go back to airplanes, there are:
- Preventable, foreseeable issues that can be taken care of on the ground. Oil changes, regular maintenance, checks for critical instrumentation, etc.
- Unforeseen issues in the air; engine fires, engine failures, electrical fires and, possibly anything you may have missed while it was on the ground.
Obviously we want to prevent #2 (problems in the air), by applying #1; (preventative maintenance on the ground). To save the on-call experience (fewer 3am calls and “surprises“), it also makes sense to do as much as possible during business hours and being pro-active with regular capacity checks, monitoring and alert tuning, etc.
I’m suggesting is that on-call should “only” be about the unforeseen issues. A switch failure, a DC-wide power failure, hardware failures..
Regular preventative maintenance could (should?) happen with site “owners” that can baby the environments, become fully aware of the nuances and “quirks” for their given environment. eg. Is there enough spare capacity to handle possible spikes in customer activity within the window of being able to provision more? Have there been back-orders in this environment before? How long do they last?
Having an SRE team split responsibilities for “site ownership” would:
- Be a good learning opportunity for everyone.
- Capacity planning should be done for every environment; we have a lot of environments and, someone should be responsible..
- Move things like capacity planning and preventable things to environment owners. It shouldn’t be “interrupt driven”; it should be something that just gets done during business hours.
- Environment owners would feel “responsible” if on-call got hit with a 3am preventable issue and, (hopefully) would improve and do better next time.
- Less noise for on-call, just dealing with unforeseen issues instead of all the preventable ones too.
- Some environments may be located in countries where only citizens of certain countries can touch it. Obviously having the SRE’s who are able to take care of them, own them, would help prevent exceptions for an on-call to deal with issues there.
- Everyone gets to “take pride in” keeping their environments (and on-call) happy!
Even if you only have one environment, rotate the SRE’s through owning it. They’ll come out knowing a lot more about it, on-call will be less stressful and the environment will likely be happier.
* I flew an airplane recently that had a fuel gauge that would stick for example. As the joke goes, the only time you have too much fuel on board is when you’re on fire.