The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

minhash_cmp - uses MinHash & SpeedyFx to compare large text data

VERSION

version 0.010

SYNOPSIS

    minhash_cmp [options] FILE1 FILE2

DESCRIPTION

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

OPTIONS

--help

This.

--epsilon

Expected error value used to compute the number of different hash functions (default: 0.05).

--k

Number of different hash functions to use (default: 400; overrides --epsilon).

--seed

Custom seed (integer).

--bits

How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.

CAVEATS

Under bits=18 setting, each initialized hash function consumes ~500KB.

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.