We want to learn things from any idea, test, change, upgrade or (heaven forbid) outage in production..
Author: bduncan
Seeking SRE
This book just arrived this morning and I’m just through the chapter on building SRE teams. Continue reading “Seeking SRE”
Hiring Questions, Problem 2
While most technical hiring questions aren’t all that relevant, this one might be more generally useful. Find duplicate files; the trick was the speedup.. Continue reading “Hiring Questions, Problem 2”
From the Get Go!
Learning a new computer language can be fun! Stretching ourselves to think about problems in new ways.. Continue reading “From the Get Go!”
Realtime Component Request Deficit
Looking for help naming (and finding other uses for) a novel technique in detecting grey failures. Possible use cases are discussed here: load balancing, finding saturation points, alerting.. [ed. Decided on the name “Saturation Factor“.] Continue reading “Realtime Component Request Deficit”
Solving the Right Problems
Ask the right questions, listen carefully and make sure that you’re not hearing just what you want to hear.. Continue reading “Solving the Right Problems”
Don’t Panic!
Some thoughts about handling critical system issues at scale.. Continue reading “Don’t Panic!”
Snowflakes
We called our albino squirrel in the backyard, “Snowflake”..
Operations in the Cloud
As an SRE, I’m very fortunate to have had training as a pilot. There are many similarities to system operations.. Continue reading “Operations in the Cloud”
Interactive bash Scripts
Building interactive commands that uses editing history and tab completion can be easy in bash and serve as a wrapper for automating tasks. Continue reading “Interactive bash Scripts”
Must Have Books.. Another One!
Not just “Must Have”, but “Must Read!”. A new book has been released and is available, free to download for a short time. Continue reading “Must Have Books.. Another One!”
vi or emacs? Really?!?
Most of the operations/engineering folks I’ve come into contact with will proclaim to be “vi” people and yet, when I watch them edit a file I cringe.. Continue reading “vi or emacs? Really?!?”
CI/CD and Optimization
When we talk of CI/CD we’re often referring to Continuous Integration and Delivery while Optimization refers to Services/Systems. What I’d like to discuss is Constant Improvement/Continuous Development and Self-Optimization.. Continue reading “CI/CD and Optimization”
The DevOps Alternative
In a previous article, “There’s Always a Problem”, I described situations that can arise with the “Engineering vs. Operations” old way. The new way is a DevOps culture.. Continue reading “The DevOps Alternative”
Preventing Trainwrecks
I keep a large framed photo of this on the wall in my office to remind me what can happen when things go “off the rails”.. Continue reading “Preventing Trainwrecks”
Hiring Questions, Problem 1
A colleague of mine once posted a hiring question to ask prospective developers: “What is the least significant 10 digits of the series: ..
?”
Don’t Aggregate, Consolidate!
In previous posts, I’ve emphasized that averages are particularly bad at characterizing most things that you might be looking for. However, storing aggregated data of any type can limit your ability to analyze data later. Continue reading “Don’t Aggregate, Consolidate!”
awk, the Often Ignored Little Language
Many people use awk for one-liners; picking out fields from logs, doing pattern matching. It’s capable of so much more however. IMO, the “littleness” of the language is one of it’s strengths. Continue reading “awk, the Often Ignored Little Language”
Bitrot, Part 2
This article has a link to a simple script I’ve used for over a decade to detect corrupted files. It will detect and report on files that have changed, been added, deleted or possibly moved within the same directory structure. Continue reading “Bitrot, Part 2”
Bitrot, Part 1
Your systems have drives set up in RAID configurations and besides, you have data copied to redundant systems and backups, right? Safe? Maybe not. I recently found corruption in a quarter of a million files that had not previously been detected, for years! Continue reading “Bitrot, Part 1”
There’s Always a Problem
Do you have insatiable curiosity and are driven by a relentless pursuit of the truth? You might make a great problem solver, but be careful how you deal with your findings! Continue reading “There’s Always a Problem”
Look Up the Stack!
If you’ve been around systems long enough, you know that opportunity for performance gains goes up dramatically, the further up the stack you look.. Continue reading “Look Up the Stack!”
(Ab)use of the R Language
For years I’ve done most of my log scraping and analysis with the usual suspects; bash, sed, awk, perl even. The log scraping still uses those tools, but lately I’ve been toying around with “R” for the analysis. Continue reading “(Ab)use of the R Language”
Averages Mostly Suck at Almost Everything..
..unless you’re dealing with baseball. When dealing with systems, many of us think “Average” is a measure of “Typical” or “Normal“. Many systems people will also use averages to look for “Abnormal“. However, average (or mean) doesn’t represent either “normal” or “abnormal” very well.. Continue reading “Averages Mostly Suck at Almost Everything..”
WSMeter: Performance Evaluation for Warehouse-Scale Computers
Many of us have dealt with making changes in production environments, possibly against hundreds or thousands of systems and, we’d like to know how the change impacted performance. It was with this in mind that I eagerly read through the paper describing WSMeter. Continue reading “WSMeter: Performance Evaluation for Warehouse-Scale Computers”
Friday the 13th One Liner
Just for fun, how many combinations of months are there where Friday falls on the 13th? This one-liner will print out a table of month combinations along with the years for a given range. Continue reading “Friday the 13th One Liner”
Deep Dive, EPL Dotplots
While working at RIM, I had the privilege of working with some brilliant engineers. During that time I developed a few of the techniques that I’ll be describing; the EPD (Event-Pair-Difference) graph described in my previous post and the EPL (Event-Pair-Latency) Dotplot are a few of them. Continue reading “Deep Dive, EPL Dotplots”
Shades of Grey
System failures are often not black and white, but shades of grey (gray?)..
Detecting and alerting on “performance-challenged” system components are a lot more difficult than detecting black or white (catastrophic failures). The metrics used are usually of the “time vs. latency” or “time vs. event count” variety, often aggregated and, often by using averages. All of these tend to obscure what we are looking for and have a very low “signal to noise ratio“.