System failures are often not black and white, but shades of grey (gray?)..
Detecting and alerting on “performance-challenged” system components are a lot more difficult than detecting black or white (catastrophic failures). The metrics used are usually of the “time vs. latency” or “time vs. event count” variety, often aggregated and, often by using averages. All of these tend to obscure what we are looking for and have a very low “signal to noise ratio“.