Hiring Questions, Problem 2 | bill duncan's blog

While most technical hiring questions aren’t all that relevant, this one might be more generally useful. Find duplicate files; the trick was the speedup..

The problem statement was something like, “..using your favourite language, find all the duplicate files in a given directory and output the list of sets of identical files.”

Easy enough in bash. Use find to output the filenames, pipe to xargs with a hashing function (e.g. md5sum, shasum) and output lists that hash to the same value. My first stab looked something like this:

find $1 -type f |  xargs md5um |
awk '
  $1 in sums {
    dups[$1]++
  }
  {
    sums[$1] = sums[$1] " " $2
  }
  
  END {
    for (i in dups) {
      print sums[i]
    }
  }
'

The trick was, “How do we speed this up?” Reading and calculating checksums for all the files might take a long time. Do we need to check them all?

Of course not! The file size is stored in the inode. We can eliminate any files that have unique sizes quickly!

So, with an almost identical block of awk, we eliminate those and only md5sum on the remaining files.

find $1 -type f |  xargs ls -l |

awk '
  $5 in size {
    dups[$5]++
  }
  {
    size[$5] = size[$5] " " $NF
  }
  END {
    for (i in dups) {
      print size[i]
    }
  }
' | tr ' ' '\n' | xargs md5sum |

awk '
  $1 in sums {
    dups[$1]++
  }
  {
    sums[$1] = sums[$1] " " $2
  }
  END {
    for (i in dups) {
      print sums[i]
    }
  }
'

The optimization was dramatic when I tested on some of my own directories. A finished script is a little more complicated as many files have spaces and other nasty characters to deal with, but I’ll send it along to anyone interested..