Detecting Performance Issues | bill duncan's blog

The Tail at Scale Revisited

My last article discussed some of the missing math related to setting back-end objectives. This article presents a chart which is useful in understanding the relationship to the user experience and we examine ways to dramatically improve the overall performance. Continue reading “The Tail at Scale Revisited”

BPF Performance Tools

BPF is one of the Swiss Army Knife tools for Performance Engineering on Linux. Continue reading “BPF Performance Tools”

Event Logs and A.I.

Many companies in the logging/monitoring space will try to sell you on AI and ML (Artificial Intelligence and Machine Learning) to find abnormal. Continue reading “Event Logs and A.I.”

Event Logs and K.I.S.S.

I’ve worked with event logs for, well, decades. There are quite a few companies that offer services for managing logs and, afaik, only a few doing it right. Continue reading “Event Logs and K.I.S.S.”

Reading Week #4

Monitoring the SRE Golden Signals, an excellent overview by Steve Mushero.. Continue reading “Reading Week #4”

Reading Week #2

First of all, Merry Christmas if you celebrate it, Happy Holidays if you don’t! This week’s interesting read is about a subject I love.. Continue reading “Reading Week #2”

Realtime Component Request Deficit

Looking for help naming (and finding other uses for) a novel technique in detecting grey failures. Possible use cases are discussed here: load balancing, finding saturation points, alerting.. [ed. Decided on the name “Saturation Factor“.] Continue reading “Realtime Component Request Deficit”

Don’t Aggregate, Consolidate!

In previous posts, I’ve emphasized that averages are particularly bad at characterizing most things that you might be looking for. However, storing aggregated data of any type can limit your ability to analyze data later. Continue reading “Don’t Aggregate, Consolidate!”

There’s Always a Problem

Do you have insatiable curiosity and are driven by a relentless pursuit of the truth? You might make a great problem solver, but be careful how you deal with your findings! Continue reading “There’s Always a Problem”

Look Up the Stack!

If you’ve been around systems long enough, you know that opportunity for performance gains goes up dramatically, the further up the stack you look.. Continue reading “Look Up the Stack!”

(Ab)use of the R Language

For years I’ve done most of my log scraping and analysis with the usual suspects; bash, sed, awk, perl even. The log scraping still uses those tools, but lately I’ve been toying around with “R” for the analysis. Continue reading “(Ab)use of the R Language”

Averages Mostly Suck at Almost Everything..

..unless you’re dealing with baseball. When dealing with systems, many of us think “Average” is a measure of “Typical” or “Normal“. Many systems people will also use averages to look for “Abnormal“. However, average (or mean) doesn’t represent either “normal” or “abnormal” very well.. Continue reading “Averages Mostly Suck at Almost Everything..”

Deep Dive, EPL Dotplots

While working at RIM, I had the privilege of working with some brilliant engineers. During that time I developed a few of the techniques that I’ll be describing; the EPD (Event-Pair-Difference) graph described in my previous post and the EPL (Event-Pair-Latency) Dotplot are a few of them. Continue reading “Deep Dive, EPL Dotplots”

Shades of Grey

System failures are often not black and white, but shades of grey (gray?)..

Detecting and alerting on “performance-challenged” system components are a lot more difficult than detecting black or white (catastrophic failures). The metrics used are usually of the “time vs. latency” or “time vs. event count” variety, often aggregated and, often by using averages. All of these tend to obscure what we are looking for and have a very low “signal to noise ratio“.

Continue reading “Shades of Grey”