Looking for help naming (and finding other uses for) a novel technique in detecting grey failures. Possible use cases are discussed here: load balancing, finding saturation points, alerting.. [ed. Decided on the name “Saturation Factor“.]
Originally named “Event Pair Differences” a decade ago and I wrote about it before. That name describes how the metric is derived, but it needs a better name to describe what it does or what it is capable of. Potentially, it can do much more than just graphing performance-challenged components and services!
This technique is a measure of whether a component/service is keeping up with the requests coming in. It is not a measure of latency or concurrency. Working with operations that normally take milliseconds or minutes, the metric is available immediately (not just when the operations are finished and the latency measurements are available).
1. Load Balancing
The fact that the metric is available immediately (well, defined by your bucket size, typically seconds) with very high signal to noise ratio, suggests that it might be an ideal candidate for identifying and load balancing around components that are at capacity.
Load balancing methods can generally be characterized as either static (predetermined) or dynamic. (Note that methods can also be combined.)
Current methods include:
- Round robin, often used, does not take loading into consideration, static.
- Randomized, similar to Round robin, static.
- Least number of connections or concurrency, does not use actual loading
- Other resource loading factors (e.g. cpu, memory etc)
- Based on latency of requests, does not respond immediately, usually noisy
These are broad categories and I’m sure there are others. None of the usual methods actually respond to a components ability to deliver responses in a timely manner directly. Dynamic methods are normally better at balancing and responding to changes, but don’t do as well as they might. Dropbox had some interesting comments on load balancing with Bandaid. Netflix shared their cloud load balancing tool, Eureka!.
1.1 Resource Usage
Resource usage (e.g. connections, cpu, memory etc.) are at best an indirect guess as to how the component will respond. There are other dynamic variables generally outside the scope of measuring resource usage for a component. For example, virtual machines (or for that matter, containers) are often impacted by other activity on the same hosts. Therefore, any resource usage of the VM or container is not enough to accurately predict response latencies directed at a specific virtual instance.
A hypothetical example might be two of the same components, with one component in a happy place at 50 concurrent connections while another component on a busy host is getting bogged down at 20 concurrent requests. Resource usage is an imperfect, indirect predictor of performance on a given component without considering the other contributing factors.
1.2 Request Latencies
Request latencies require measurements from requests that have finished (and therefore, are not available until they do). There is a built-in delay for using this information and, it’s usually quite noisy (due to host factors, variability of request complexity, caching etc.). Some form of smoothing is usually required to reduce the noise, implying that state must be maintained across measurements and calculations are necessary across them.
1.3 Realtime Component Request Deficit
Combining this new technique with others would enable cheap, accurate and timely routing around temporarily performance challenged components. A component’s ability to “keep up” is measured directly and available immediately. If all components were becoming performance challenged, mitigating responses such as load shedding or backoff mechanisms could be triggered. Quickly.
2. True Capacity Testing
Usual capacity testing involves ramping up the request rate and measuring when a given ratio of requests return that are over a given latency. This is how we normally measure SLO.
It is a valid technique unless we are searching for why we can’t push components faster. To find that answer, you need to push a component into saturation, where it can no longer keep up with the rate of requests. At that point, you should be able to find out what the bottleneck is and answer questions such as “why can’t we double the performance“? The usual methods of capacity testing won’t find the saturation point accurately, making searches for bottlenecks harder. Pushing components until they break complicates finding the bottlenecks that prevent it from going faster.
The realtime component request deficit metric will give that saturation point accurately, usually before things actually “break”, enabling a more accurate search for the bottlenecks in real time.
This will also enable finding where along the hockey-stick the SLO is, giving you more information regarding what redundancy is needed to maintain SLO, given a number of component failures.
3. Realtime Grey Failure Alerting
Several qualities of this metric are valuable as part of an alerting system. Components that kick over and fail catastrophically are easy to detect; grey failures, not as much. The usual methods are prone to false negatives/positives and also have issues in determining where to draw the thresholds.
The Realtime component request deficit metric has the following qualities making it ideal for detecting grey failures:
- High signal to noise ratio — can be tuned for delivering very few false positives/negatives.
- Realtime — news as it happens, not when it’s too late.
- Easy to calculate — no state or heavy math required, simply subtract simultaneous counts of two normally consecutive events that sandwich an interesting operation or service.
- Sortable — deviation from zero across all components gives the worst performing.
- Actionable — the high signal to noise gives confidence in automating mitigation, such as load shedding or backoffs.
While the technique makes compelling graphs when grey failures are taking place; my belief is this could be used for so much more..
- Shades of Grey
- Deep Dive, EPL Dotplots
- Meet Bandaid, the Dropbox service proxy Discusses some of their observations about load balancing.
- Active Benchmarking where Brendan Gregg asks the question, what is preventing a component from going faster?
- The Essential Guide to Queueing Theory by Vividcortex where Baron Schwartz gives us an excellent refresher, discusses the “hockey-stick” utilization graph and why this happens with Markovian processes.
- Netflix Shares Cloud Load Balancing And Failover Tool: Eureka!