A bit of recreational code
I was reading Matt Might’s What every computer scientist should know when I saw this fun-sounding Unix exercise:
Report duplicate MP3s (by file contents, not file name) on a computer.
I’m not too experienced with shell programming, but I thought it might be fun to see if I could come up with a one-liner to identify all dupes and print them out for me. Not super exciting for shell pros, but I enjoyed writing it. My strategy was to compute the SHA-1 of every file and compare them. I ended up with this slightly obfuscated line:
find . -type f -exec openssl dgst -sha1 {} \; |
awk '{ print $NF, $0 }' | sort -k 1 | uniq -D -w 40
Breaking it down a bit:
find . -type f -exec openssl dgst -sha1 {} \;
runs ‘openssl dgst -sha1’ on every regular file in the current directory.
awk '{ print $NF, $0 }'
This is a bootleg bit of code so that I could have the SHA-1 as the first field, for passing to sort. And finally,
uniq -D -w 40
identifies lines which are identical for the first 40 chars.
Took about 40 minutes to run on my >100GB library and correctly reported the duplicates. Not bad.