The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::Searcher::Similars - Fast similar-files finder

SYNOPSIS

  use File::Searcher::Similars;

  File::Searcher::Similars->init(0, \@ARGV);
  similarity_check_name();

Similar-sized and similar-named files are picked as suspicious candidates of duplicated files.

Please note that this version is deprecated. Future versions are released as File::Find::Similars.

DESCRIPTION

Extremely fast file similarity checker. It uses advanced soundex vector algorithm to determine the similarity between files. Generally it means that if there are n files, each having approximately m words, the degree of calculation is merely

  O(n^2 * m)

which is over hundreds times faster than any existing file fingerprinting technology.

The self-test output will help you understand what the module do and what would you expect from the outcome.

  $ make test
  PERL_DL_NONLAZY=1 /usr/bin/perl "-Iblib/lib" "-Iblib/arch" test.pl
  1..4
  # Running under perl version 5.010000 for linux
  # Current time local: Wed Oct 29 11:35:06 2008
  # Current time GMT:   Wed Oct 29 15:35:06 2008
  # Using Test.pm version 1.25
  # Testing File::Searcher::Similars version 1.23
  
  == Testing 1, files under test/ subdir:
  
    9 test/(eBook) GNU - Python Standard Library 2001.pdf
    3 test/CardLayoutTest.java
    5 test/GNU - 2001 - Python Standard Library.pdf
    4 test/GNU - Python Standard Library (2001).rar
    9 test/LayoutTest.java
    3 test/PopupTest.java
    2 test/Python Standard Library.zip
    5 test/TestLayout.java
  ok 1
  
  Note:
  
  - The fileSimilars.pl script will pick out similar files from them in next test.
  - Let's assume that the number represent the file size in KB.
  
  == Testing 2 result should be:
  
  ## =========
             3 'CardLayoutTest.java' 'test/'
             5 'TestLayout.java' 'test/'
  
  ## =========
             4 'GNU - Python Standard Library (2001).rar' 'test/'
             5 'GNU - 2001 - Python Standard Library.pdf' 'test/'
  ok 2
  
  Note:
  
  - There are 2 groups of similar files picked out by the script.
    The second group makes more sense.
  - The similar files are picked because their file names looks similar.
  - However, the file size plays an important role as well.
  - There are 2 files in the second similar files group.
  - The file 'Python Standard Library.zip' is not considered to be similar to
    the group because its size is not similar to the group.
  
  == Testing 3, if Python.zip is bigger, result should be:
  
  ## =========
             3 'CardLayoutTest.java' 'test/'
             5 'TestLayout.java' 'test/'
  
  ## =========
             4 'Python Standard Library.zip' 'test/'
             4 'GNU - Python Standard Library (2001).rar' 'test/'
             5 'GNU - 2001 - Python Standard Library.pdf' 'test/'
  ok 3
  
  Note:
  
  - There are 3 files in the second similar files group.
  - The file 'Python Standard Library.zip' is now in the 2nd similar files
    group because its size is now similar to the group.
  
  == Testing 4, if Python.zip is even bigger, result should be:
  
  ## =========
             3 'CardLayoutTest.java' 'test/'
             5 'TestLayout.java' 'test/'
  
  ## =========
             4 'GNU - Python Standard Library (2001).rar'       'test/'
             5 'GNU - 2001 - Python Standard Library.pdf'       'test/'
             6 'Python Standard Library.zip'                    'test/'
             9 '(eBook) GNU - Python Standard Library 2001.pdf' 'test/'
  ok 4
  
  Note:
  
  - There are 4 files in the second similar files group.
  - The file 'Python Standard Library.zip' is still in the group.
  - But this time, because it is also considered to be similar to the .pdf
    file (since their size are now similar, 6 vs 9), a 4th file the .pdf
    is now included in the 2nd group.
  - If the size of file 'Python Standard Library.zip' is 12(KB), then the
    second similar files group will be split into two. Do you know why and
    which files each group will contain?

The File::Searcher::Similars package comes with a fully functional demo script fileSimilars.pl. Please refer to its help file for further explanations.

This package is highly customizable. Refer to hash variable %config and/or the 3 arrwash_ functions for customization hints.

AUTHOR

 Author:  SUN, Tong <suntong at cpan.org>
 HomeURL: http://xpt.sourceforge.net/

SEE ALSO

File::Compare(3), perl(1) and the following scripts.

## File::Find::Duplicates - Find duplicate files

http://belfast.pm.org/Modules/Duplicates.html

my %dupes = find_duplicate_files('/basedir1', '/basedir2');

When passed a base directory (or list of such directories) it returns a hash, keyed on filesize, of lists of the identical files of that size.

## ch::claudio::finddups - Find duplicate files in given directory

http://www.claudio.ch/Perl/finddups.html

ch::claudio::finddups is a script as well as a package. When called as script it will search the directory and its subdirectories for files with (possibly) identical content.

To find identical files fast this program will just remember the Digest::SHA1 hash of each file, and signal two files as equal if their hash matches. It will output lines that can be given to a bourne shell to compare the two files, and remove one of them if the comparison indicated that the files are indeed identical.

Besides that it can be used as a package, and gives so access to the following variables, routines and methods.

## dupper.pl - finds duplicate files, optionally removes them

http://sial.org/code/perl/scripts/dupper.pl.html

Script to find (and optionally remove) duplicate files in one or more directories. Duplicates are spotted though the use of MD5 checksums.

COPYRIGHT

Copyright (c) 2001-2008 Tong SUN. All rights reserved.

TODO