Reading Week #2

First of all, Merry Christmas if you celebrate it, Happy Holidays if you don’t! This week’s interesting read is about a subject I love.. Investigating problems using log files!

I found this week’s interesting read on Adrian Colyer’s blog, the morning paper.  It’s a summary of the ACM paper, Identifying impactful service system problems via log analysis.

My first impression is that it’s a solution looking for a problem. The technology is interesting, but perhaps more complicated than it needs to be.

As SRE’s, we don’t have much control over MTBF.  Shit happens.  Part of our job is to help accelerate those changes in the form of new releases, often the cause of issues in production.

What we have more control over is the MTTA, MTTM and MTTR; mean time to alert, mitigate and repair respectively.  Observability, metrics and log files are all instrumental in shortening these times to improve the user experience.

Sometimes problems are instantaneous; component or system failures.  Often there are early warning signs however; components that are still alive, but falling behind in processing.  We obviously want to detect potential issues as early as possible, before impacting the user.  How we usually find out about problems can be summarized as:

  1. Users tell us..  We’d like to avoid finding out this way!
  2. Components or systems completely fail.  This should be easy to detect, they stop responding.  We’d also like to prevent or mitigate this if possible to prevent #1.
  3. Errors in log files, should also be easy to detect, quantify and alert on.
  4. AI — not Artificial Intelligence, but Artificial Ignorance can be used to find unusual issues; code that isn’t often executed and may spell trouble.  More difficult to alert on, but may help in diagnosing issues.
  5. Unusually long (or even short) latencies or event counts which are outside of a “normal” range.  Both of these “usual” metrics have problems with setting the thresholds to minimize false positives and negatives.  This is traditionally very difficult to “get right”.  I’ve seen latency values with at least 5 orders of magnitude difference which were “normal” and, wrote about it here.

The Tail at Scale

The authors of this paper acknowledge the tail at scale and mention that it is why they developed their cascading clustering technique.

“The distribution is highly imbalanced and exhibits strong long-tail property, which poses challenges for log-based problem identification.”

..and

“However, the conventional clustering methods are incredibly time-consuming when the data size is large because distances between any pair of samples are required.  As mentioned in Section 2, log sequences follow the long tail distribution and are highly imbalanced. Based on the observation, we propose a novel clustering algorithm, cascading clustering, to group sequence vectors into clusters (different log sequence types) promptly and precisely, where each cluster represents one kind of log sequence (system behavior).”

The fact that normal covers such a wide range with typical KPIs and that latencies that are longer than they should be can have such a huge impact on SLOs and SLAs, possibly even predicting complete failure; this is why it is such an interesting problem to solve.

Saturation Factor

The technique which I’ve successfully used for early warnings of grey failures is both simple, easy to do in real-time and has a very high signal-to-noise ratio.  I’ve described the saturation factor in articles Shades of Grey and Realtime Component Request Deficit.

The technique is often accurate and fast enough to mitigate issues before they kick over components.  It can also be used after an incident, analyzing log files to draw boundaries around what happened, which components were effected and when.  The EPL Dotplots can then sometimes be used to suggest what happened.

Summary

Technically, it was an interesting paper to read and am eager to try the technique out at some point.  I may be wrong, but believe that the saturation factor is a more useful technique for identifying grey failures (suboptimal performance); simpler, faster and probably more accurate judging from their own success rates.