I’ve worked with event logs for, well, decades. There are quite a few companies that offer services for managing logs and, afaik, only a few doing it right.
Low tech beats high tech in almost all use cases for storing time-series event log data, yet most companies in this space tack on various indexing methods like Elasticsearch and Lucene to do data lookups.
Consider this, when you are trying to diagnose or analyze system issues, you are almost certainly looking at specific applications and time frames. Storing the data strategically would enable you to locate the data you’re interested in deterministically, avoiding the use of indexes altogether!
Reading Data Sequentially
Once you’ve found the time series data you want, almost every use case such as graphing or analyzing will entail reading the data sequentially. This operation is actually slower and less efficient requiring many more disk accesses when using indexes. (e.g. Try exporting say, a million events of your data from your logging provider. If they use indexes, this may bring their system to it’s knees, if it’s possible at all!)
Indexes also increase the amount of data stored, dramatically. Storing log data via Elasticsearch can multiply the space required many times. Without indexes, the data can actually be compressed to a fraction of the original size.
Things That Break
There are fewer things that can break when going low-tech. Indexing data requires a lot of resources and, having worked with Elasticsearch for logging recently I can tell you that what breaks the most is indexing. It might not always be the root cause, but it’s invariably involved in some way. Indexing (and access to your data) can get behind by hours during a major incident.
One of my biggest frustrations in dealing with Elasticsearch based logging however, was the inability to do much with the data because of cardinality issues. Can I create some graphs, using the latency field in the log lines? No, because it’s a “high cardinality field”. You would almost have to export the data and do your own graphing. See above on how difficult exporting is.
So, K.I.S.S. is cheaper, faster, breaks less often, you can store the data for longer periods and you can use the data with any cardinality. What’s not to like?!?!
I can only describe how I’ve done it in the past. At a high level, it is simply dropping the data on NFS storage in a directory structure that facilitates fast lookup. For example, the directory structure I used in the past was similar to this:
The NFS storage was accessable by workers that would be spawned off concurrently in a brute force scatter/gather/sort pipeline which did any remaining filtering.
The devil is in some of the details. Compression for example. I had also factored out invariant data leaving only the timestamp data and variables stored positionally to compress. Log lines could be recreated on the fly by restoring the invariant data associated with that logline. But the magic was in the fact that analyzing the data was far easier and compression, much better. We were able to store terabytes of compressed data spanning over many years with no speed penalties which was instrumental in reviewing past issues and resolving current ones.
The other option which I alluded to above is using a service that can do this for you. Humio has done a much better job of providing a service than any of my old home-grown techniques. I’m guessing that some of the underlying technology may be similar; being able to deterministically select time-series data fast, without using indexes. If you’re on the lookout for a great logging solution, I highly recommend going through their tutorial and documentation which I’m in the process of doing now. Note that I’m not affiliated with Humio, but I’m currently researching the market and like what I see there..