Bitrot, Part 1 | bill duncan's blog

Your systems have drives set up in RAID configurations and besides, you have data copied to redundant systems and backups, right? Safe? Maybe not. I recently found corruption in a quarter of a million files that had not previously been detected, for years!

RAID in redundant configurations will only protect against drive failures, ’nuff said.

Backups will only save you if they actually work (you do trial restores, right?) and if you can detect data corruption.

Redundant copies; again, you have to have a mechanism to detect corruption and, more importantly, use it.

The case where I’d found a quarter of million corrupt files, md5 hashes were stored in the metadata for the files, but were being tested so slowly that it was ineffective. (There were probably a trillion or so files and petabytes of data in total. We’ll examine the kinds of corruption found in a future article.)

If you do not have corruption detection built into the data and applications themselves, then you need some other method. Some filesystems like ZFS and btrfs have the ability to detect corruption.

In a future article I’ll present a simple script/solution that you can implement to at least detect corruption when it happens and hopefully your redundant copies can then be used to fix it. Otherwise, they may just silently sit there..

One thought on “Bitrot, Part 1”

Backups should always be checked by re-loading into a disc and then doing a binary compare of that disc, offline, with no changes having been made, But who has time to do that? Business spressures belie the professional requirements. I remember an incident told to me by a consultant who was doing some work for the C*C back in the seventies/eighties. There were strict rules against consultants taking binary data off site, but no security to search their bags. DEC had found a bug in their BUP software which affected the TS11 tape drives, and had issued a patch. The problem was that data could randomly be written and read back correctly during BUP but were garbage on restore.

The local system manager had missed the mandatory patch in his haste to meet his managers’ targets. One day a winchester crashed, and when the FE had replaced it, he went to do a restore from his TS11 tapes and discovered they contained garbage.

While he was sitting at his desk, head in hands, wondering if jumping out of the fifth floor window could give him a quick death, my friend came back in and asked what the problem was.

When he told him, my friend took the mag tape out of his briefcase and said “well it is a good job I didn’t save the data on this then.”

But sometimes there are not people who break the rules to save your bacon!

Comments are closed.