Don’t Aggregate, Consolidate!

In previous posts, I’ve emphasized that averages are particularly bad at characterizing most things that you might be looking for. However, storing aggregated data of any type can limit your ability to analyze data later.

If you’re involved in photography, you know that storing original negatives or raw files gives you much more latitude in the ability to process a photo later. Anything else loses “bits” of information that you can no longer recover.

The same thing applies to storing aggregated data rather than the raw event data. Most of the techniques I’ve developed over the years for performance engineering requires it. Some of my favourite techniques described in the following articles would not be possible otherwise:

  1. Event-Pair-Difference Graphs
  2. Event-Pair-Latency Dotplots
  3. Various Methods for looking at Latency Distributions

Some people would have you think that storing all the raw data would be too expensive in terms of storage. Not so. First of all, storage is cheap these days, but what you want to do is filter the raw log data into a form that’s easy process. Consolidate the raw event data that you really want and throw away the rest. (Well, keep the raw log data for a little while, but you can probably store the raw event data forever.) I’ll have more on this as well as some scripts in a future article..