Some thoughts about handling critical system issues at scale..
As far as availability goes, we don’t have much control over MTBF. Things break. We have more control MTTR, which really includes the time to detect and respond to failures as well.
There are several goals in any incident response..
- Primary goal is the customer experience and trumps all others.
- Communicate to the customers if service is or was impacted.
- What can we learn about the incident to make it unique and interesting?
- How can we improve our response to incidents? Hindsight is always humbling. I can’t think of any incident I’ve handled where I didn’t think, “why didn’t we see this sooner?“. (Hopefully because they weren’t as memorable and we handled it more quickly..)
- ..and of course, fix it!
If you are involved in incident response, here are some pointers that I’ve learned over the years.
- Don’t panic! We think more logically when we remain calm. Encourage focus.
- As I’ve pointed out here, the first priority is to keep the airplane flying (or the service servicing). Find a way to mitigate the problem and save the customer experience as much as possible.
- Draw boundaries around the issue; timing, scope and severity.
- Communicate and call on the expertise required.
- Encourage fact finding to develop the theories and anti-theories with everyone.
- Use whatever methods available for recording what transpires, including things that look ok and discarded theories.
- Try not to trample on any evidence that might be used later to dig further.
Some (not exhaustive) techniques I’ve used in drilling down to fix issues:
- What changed? Recent rollout history? Nothing happens randomly except stray gamma rays..
- Bisection, What is working well, as well as what’s not.. Find the boundaries.
- Any similar issues in the past? Look up old RCA’s, tickets..
- In fact, recent past tickets may have clues.
- Artificial ignorance; look for log lines or other events that we don’t normally see. Basically, filter out “normal” and what you should be left with is abnormal.
- Communicate and encourage people to keep looking and reporting what they see, including normal stuff.
- Look for simple explanations.
- Narrow down the timing of the incident start, as this may give clues to cause.
- Develop some theories and counter theories.
- Process of elimination, rule some theories out.
- Test hypothesis if practical and won’t interfere with customer experience.
- Top down approaches sometimes works, drill down from the applications.
- Assume nothing, follow the data..
The approaches you use may vary depending on the tools, instrumentation and the specific circumstances.
What works for you?