Finding Duplicate Files
by Julius C. Duque
=======================
Using "find"
------------
find is a small but powerful utility that is available on all
UNIX/Linux systems. The following command, for example, tells find to
descend into /tmp (and recursively descend into all subdirectories it
encounters), and print to the standard output the names of all files
and subdirectories it finds.
find /tmp -name "*"
find's output is similar to this:
/tmp
/tmp/tex2pdf-root
/tmp/Gladman
/tmp/Gladman/sha2.c
/tmp/Gladman/uitypes.h
/tmp/Gladman/test.c
/tmp/Gladman/sha2.h
/tmp/Gladman/a.out
/tmp/guile-1.6.4
The best feature of find, just like any good UNIX tool, is that its
output can be redirected as input to another program. So, instead of
displaying find's output to the screen, you can use a pipe to give its
output to the next program for processing, as in
find /tmp -name "*" | ./sha
The "sha" Perl Script
---------------------
The output of the Perl script, sha, consists of a 40-character SHA-1
digest, followed by two single spaces, and lastly, followed by a
filename.
The output of the command
find /tmp -name "*" | ./sha
gives us something like:
912e2e1bea5c3d19393169c58009cd67b816c8eb /tmp/Gladman/sha2.c
4107f5678cb667ad2756d5dd3f4a27035301aa49 /tmp/Gladman/uitypes.h
1b11d21492d14f49be8d462607e307012570fa6c /tmp/Gladman/test.c
1b11d21492d14f49be8d462607e307012570fa6c /tmp/Gladman/sha2.h
b1b48c7339e998571e754383b0f50ab827b326c3 /tmp/Gladman/a.out
The "finddups" Perl Script
--------------------------
"finddups" produces an associative array, %dups, using the digests
computed by sha (the first 64 characters) as keys.
If finddups is called with the --verbose (or its short form, -v)
options switched on, the digests for duplicate files are printed as
well. For example,
find ./testdir -name "*" | ./sha | ./finddups --verbose
or
find ./testdir -name "*" | ./sha | ./finddups -v
produces something like:
330acbc4480be85ffc3b89a3e89dae74d2dd322eee9ca38a88cebac1f60a133a
./testdir/sha
./testdir/testfile2
71775835eb1b9a75a7065da28ef689e39a696da870dc95d939674c5ae6ce7a70
./testdir/finddups
./testdir/testfile1
Here, the files "sha" and "testfile2" are identical. Meanwhile,
"finddups" and "testfile1" are identical as well.
An extended discussion of this technique is featured in the December
2003 issue of The Perl Journal, written by Julius C. Duque.