As an SRE, I’m very fortunate to have had training as a pilot. There are many similarities to system operations..
- Many redundancies are built into airplanes, reducing the number of SPOF’s.
- Walkaround is done on the ground to eliminate many potential failures in the air.
- Checklists are routinely used, so nothing is forgotten. See Checklists and Runbooks
- Engine run ups are done on the ground to verify satisfactory engine performance. (think canary)
- Emergency procedures are discussed before leaving the ground (going into production).
- Instructors will often do some “chaos engineering” while in the air. In the middle of an exercise, they will pull the throttle (or other stunts) to see how the student reacts to engine failures, fires, instrument failures, etc.
- The first reaction to an emergency always involves “fly the plane“. Adjust the attitude for “best glide” to mitigate the failure. Only then can you go through to possibly fix the cause or communicate.
Flying in the Cloud
The inner ear and intuition does not register slow changes in attitude at all.
A flight instructor will demonstrate this to students by having them try to “fly straight and level” while they are effectively blind; flying with a hood. The student can neither see outside nor see the instruments. After about 20-30 seconds the instructor will have you remove the hood and recover. Invariably, you are in a nose dive and, you didn’t know it.
Operating in the cloud, either in the air or operating a flock of instances in data centers that you can’t see, touch or feel; you must rely on your instrumentation. Your intuition alone will not work. Gravity would prevail too quickly in the cloud without the real time situational awareness from instrumentation and alerting that you can trust.