Text::Bloom - Evaluate Bloom signature of a set of terms
my $b = Text::Bloom->new(); $b->Compute( qw( foo bar baz ) ); my $sig = $b->WriteToString(); $b->WriteToFile( 'afile.sig' ); my $b2 = Text::Bloom::NewFromFile( 'afile.sig' ); my $b3 = Text::Bloom->new(); $b3->Compute( qw( foo bar barbaz ) ); my $sim = $b->Similarity( $b2 ); my $b4 = Text::Bloom::NewFromString( $sig );
Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents.
Text::Bloom
The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.
Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function.
HashV
Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.
Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.
The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.
The signature may be used for:
testing whether a given set of terms is present in the document,
computing which fraction of terms are common to two documents.
The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.
Compress::Zlib
The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/.
/[0-9a-z]+/
There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV.
QuantizeV
The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V suffix. Redefining other methods will require greater attention.
V
The constructor. No arguments are required.
$b = Text::Bloom->new();
Take a string written by WriteToString (see below) and create a new Text::Bloom with the same contents; call die whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.
WriteToString
die
my $b = Text::Bloom::NewFromString( $str );
The return value is a blessed reference; put in another way, this is an alternative contructor.
The string should have been written by WriteToString; you may of course tweak the string contents, but at this point you're entirely on you own.
Utility function that reads a binary file and performs a NewFromString on its content; see its counterpart, WriteToFile.
NewFromString
WriteToFile
my $b2 = Text::Document::NewFromFile( 'foo.sig' );
Set and get the size of the filter, in bits. The default size is currently 128K.
print 'size is ' . $b->Size() . "\n"; $b->Size( 65536 );
The Size method must be called before the Compute method in order to have effect.
Size
Compute
Compute the Bloom signature from the given set of words and store it internally.
$b->Compute( qw( foo bar baz foobar bazbaz ) );
Makes use of the QuantizeV method.
Convert a term into an integer; must return an integer in the range 0 .. $Text::Bloom::p-1.
$Text::Bloom::p-1
It is called as
my $hash = $b->QuantizeV( $term );
The current version is designed for strings matching /[a-z0-9]+/. Other characters do not cause errors, but degrade the hash function performance.
/[a-z0-9]+/
This function is a likely candidate for redefinition.
Convert an integer to a (smaller) integer, according to one of a class of similar functions.
It is internally called as:
my $index = $b->HashV( $order, $value );
The $value must belong to the interval 0..$Text::Bloom::p-1, while the index must lie in 0..size-1. $order is a small integer from 0 to d-1.
$value
$order
The default implementation is
index = m[order] * value + q[order] (mod size)
the values of m and q are taken from the array @Text::Bloom::hashParam; the form of the function is taken from [2].
@Text::Bloom::hashParam
Convert the Bloom signature into a string which can be saved and later restored with NewFromString. Compute must have been called previously.
my $str = $b->WriteToString();
The string begins with a header which encodes the originating package, its version, the parameters of the current instance.
Whenever possible, Compress::Zlib is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib, the bit string is saved uncompressed.
These convenience functions just call their String counterparts and read/write the file specified in the argument.
$b->WriteToFile( 'foo.sig' );
spinellia@acm.org (Andrea Spinelli) walter@humans.net (Walter Vannini)
Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library).
M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", Communications of the ACM, 32, 10, October 1989, pages 1237-1239. (available electronically from ACM Digital Library).
On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.
2001-11-02 - initial revision
To install Text::Document, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Document
CPAN shell
perl -MCPAN -e shell install Text::Document
For more information on module installation, please visit the detailed CPAN module installation guide.