Станислав Пусеп > Text-SpeedyFx-0.011 > Text::SpeedyFx

Download:
Text-SpeedyFx-0.011.tar.gz

Dependencies

Annotate this POD

Website

View/Report Bugs
Module Version: 0.011   Source  

NAME ^

Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION ^

version 0.011

SYNOPSIS ^

    use Data::Dumper;
    use Text::SpeedyFx;

    my $sfx = Text::SpeedyFx->new;

    my $words_bag = $sfx->hash('To be or not to be?');
    print Dumper $words_bag;
    #$VAR1 = {
    #          '1422534433' => '1',
    #          '4120516737' => '2',
    #          '1439817409' => '2',
    #          '3087870273' => '1'
    #        };

    my $feature_vector = $sfx->hash_fv("thats the question", 8);
    print unpack('b*', $feature_vector);
    # 01001000

DESCRIPTION ^

XS implementation of a very fast combined parser/hasher which works well on a variety of bag-of-word problems.

Original implementation is in Java and was adapted for a better Unicode compliance.

METHODS ^

new([$seed, $bits])

Initialize parser/hasher, can be customized with the options:

$seed

Hash seed (default: 1).

$bits

How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses Basic Multilingual, Supplementary Multilingual and Supplementary Ideographic planes. See also "UNICODE SUPPORT"

hash($octets)

Parses $octets and returns a hash reference (not exactly; see "CAVEAT") where keys are the hashed tokens and values are their respective count. $octets are assumed to represent UTF-8 string unless Text::SpeedyFx is instantiated with "$bits" == 8 (which forces Latin-1 mode, see "UNICODE SUPPORT"). Note that this is the slowest form due to the (computational) complexity of the associative array data structure itself: hash_fv()/hash_min() variants are up to 260% faster!

hash_fv($octets, $n)

Parses $octets and returns a feature vector (string of bits) with length $n. $n is supposed to be a multiplier of 8, as the length of the resulting feature vector is ceil($n / 8). See the included utilities cosine_cmp and uniq_wc.

hash_min($octets)

Parses $octets and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility.

UNICODE SUPPORT ^

Due to the nature of Perl, Unicode support is handled differently from the original implementation. By default, Text::SpeedyFx recognizes UTF-8 encoded code points in the range 00000-2FFFF:

Although, there is a major drawback: in this mode, each instance allocates up to 1 MB of memory.

If the application doesn't need to support code points beyond the Plane 0 (like the original SpeedyFx implementation) it is possible to constraint the address space to 16 bits, which lowers memory allocation to up to 256 KB. In fact, Text::SpeedyFx constructor accepts bit range between 8 and 18 to address code points.

LATIN-1 SUPPORT

8 bit address space has one special meaning: it completely disables multibyte support. In 8 bit mode, each instance will only allocate 256 bytes and hashing will run up to 340% faster! Tokenization will fallback to ISO 8859-1 West European languages (Latin-1) character definitions.

BENCHMARK ^

The test platform configuration:

                       Rate murmur_utf8 hash_utf8 hash_min_utf8  hash hash_fv hash_min
    murmur_utf8      6 MB/s          --      -79%          -86%  -89%    -97%     -97%
    hash_utf8       30 MB/s        376%        --          -35%  -47%    -84%     -85%
    hash_min_utf8   47 MB/s        637%       55%            --  -18%    -76%     -77%
    hash            58 MB/s        803%       90%           23%    --    -70%     -72%
    hash_fv        194 MB/s       2946%      541%          313%  237%      --      -6%
    hash_min       206 MB/s       3143%      582%          340%  259%      6%       --

All the tests except the ones with _utf8 suffix were made in Latin-1 mode. For comparison, murmur_utf8 was implemented using Digest::MurmurHash hasher and native regular expression tokenizer:

    ++$fv->{murmur_hash(lc $1)}
        while $data =~ /(\w+)/gx;

See also the eg/benchmark.pl script.

CAVEAT ^

For performance reasons, hash() method returns a tied hash which is an interface to nedtries. The interesting property of a trie data structure is that the keys are "nearly sorted" (and the first key is guaranteed to be the lowest), so:

    # This:
    $fv = $sfx->hash($data);
    ($min) = each %$fv;
    # Is the same as this:
    ($min) = $sfx->hash_min($data);
    # (albeit the later being 2x faster)

The downside is the magic involved, the delete breaking the key order, and the memory usage. The hardcoded limit is 524288 unique keys per result, which consumes ~25MB of RAM on a 64-bit architecture. Exceeding this will croak with the message "too many unique tokens in a single data chunk". The only way to raise this limit is by recompilation of the XS module:

    perl Makefile.PL DEFINE=-DMAX_TRIE_SIZE=2097152
    make
    make test
    make install

REFERENCES ^

AUTHOR ^

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2014 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: