Станислав Пусеп > Text-SpeedyFx-0.009 > minhash_cmp

Download:
Text-SpeedyFx-0.009.tar.gz

Annotate this POD

Website

View/Report Bugs
Source   Latest Release: Text-SpeedyFx-0.011

NAME ^

minhash_cmp - uses MinHash & SpeedyFx to compare large text data

VERSION ^

version 0.009

SYNOPSIS ^

    minhash_cmp [options] FILE1 FILE2

DESCRIPTION ^

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

OPTIONS ^

--help

This.

--epsilon

Expected error value used to compute the number of different hash functions (default: 0.05).

--k

Number of different hash functions to use (default: 400; overrides --epsilon).

--seed

Custom seed (integer).

--bits

How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes.

CAVEATS ^

Under bits=18 setting, each initialized hash function consumes ~500KB.

SEE ALSO ^

AUTHOR ^

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2013 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: