minhash_cmp - uses MinHash & SpeedyFx to compare large text data
minhash_cmp [options] FILE1 FILE2
MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.
Expected error value used to compute the number of different hash functions (default: 0.05).
Number of different hash functions to use (default: 400; overrides
Custom seed (integer).
How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.
bits=18 setting, each initialized hash function consumes ~500KB.
Stanislaw Pusep <email@example.com>
This software is copyright (c) 2013 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.