In both aviation and systems we build in redundancies wherever practical to avoid unpleasantness when components or subsystems fail. Continue reading “SPOFs and Partial Panel”
In previous posts, I’ve emphasized that averages are particularly bad at characterizing most things that you might be looking for. However, storing aggregated data of any type can limit your ability to analyze data later. Continue reading “Don’t Aggregate, Consolidate!”
Do you have insatiable curiosity and are driven by a relentless pursuit of the truth? You might make a great problem solver, but be careful how you deal with your findings! Continue reading “There’s Always a Problem”
If you’ve been around systems long enough, you know that opportunity for performance gains goes up dramatically, the further up the stack you look.. Continue reading “Look Up the Stack!”
For years I’ve done most of my log scraping and analysis with the usual suspects; bash, sed, awk, perl even. The log scraping still uses those tools, but lately I’ve been toying around with “R” for the analysis. Continue reading “(Ab)use of the R Language”
While working at RIM, I had the privilege of working with some brilliant engineers. During that time I developed a few of the techniques that I’ll be describing; the EPD (Event-Pair-Difference) graph described in my previous post and the EPL (Event-Pair-Latency) Dotplot are a few of them. Continue reading “Deep Dive, EPL Dotplots”
System failures are often not black and white, but shades of grey (gray?)..
Detecting and alerting on “performance-challenged” system components are a lot more difficult than detecting black or white (catastrophic failures). The metrics used are usually of the “time vs. latency” or “time vs. event count” variety, often aggregated and, often by using averages. All of these tend to obscure what we are looking for and have a very low “signal to noise ratio“.