Monitoring the SRE Golden Signals, an excellent overview by Steve Mushero..
Steve Mushero gives an awesome overview of what metrics to monitor and how, for various services. He describes the overlap and combination of the “USE” and “RED” methods with an overview of why?, what to do with them and some of the challenges.
Be sure to check out some of the links to other articles, especially from Baron Schwartz and Brendan Gregg..
- Monitoring and Observability with USE and RED
- Why Percentiles Don’t Work the Way you Think
- Anomaly Detection for Monitoring
- The Essential Guide to Queueing Theory
- Practical Scalability Analysis with the Universal Scalability Law
- The USE Method
Steve, Baron and Brendan all agree that saturation is important in understanding where the bottlenecks are. The suggestions are that “queue length” and sometimes “concurrency” should be used to indirectly measure saturation. Often these metrics are difficult to get and unreliable however.
The Saturation Factor I’ve written about may be a more direct method in many cases; often easier to calculate, less noisy and more accurate.