NAME

  Text::Bloom - Evaluate Bloom signature of a set of terms

SYNOPSIS

  my $b = Text::Bloom->new();
  $b->Compute( qw( foo bar baz ) );
  my $sig = $b->WriteToString();
  $b->WriteToFile( 'afile.sig' );
  my $b2 = Text::Bloom::NewFromFile( 'afile.sig' );
  my $b3 = Text::Bloom->new();
  $b3->Compute( qw( foo bar barbaz ) );
  my $sim = $b->Similarity( $b2 );
  my $b4 = Text::Bloom::NewFromString( $sig );

DESCRIPTION

Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents.

The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32.

Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function.

Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0.

Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1.

The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms.

The signature may be used for:

  • testing whether a given set of terms is present in the document,

  • computing which fraction of terms are common to two documents.

The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents.

The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/.

There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV.

FORESEEN REUSE

The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a V suffix. Redefining other methods will require greater attention.

CLASS METHODS

new

The constructor. No arguments are required.

  $b = Text::Bloom->new();

NewFromString

Take a string written by WriteToString (see below) and create a new Text::Bloom with the same contents; call die whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.

  my $b = Text::Bloom::NewFromString( $str );

The return value is a blessed reference; put in another way, this is an alternative contructor.

The string should have been written by WriteToString; you may of course tweak the string contents, but at this point you're entirely on you own.

NewFromFile

Utility function that reads a binary file and performs a NewFromString on its content; see its counterpart, WriteToFile.

  my $b2 = Text::Document::NewFromFile( 'foo.sig' );

INSTANCE METHODS

Size

Set and get the size of the filter, in bits. The default size is currently 128K.

  print 'size is ' . $b->Size() . "\n";
  $b->Size( 65536 );

The Size method must be called before the Compute method in order to have effect.

Compute

Compute the Bloom signature from the given set of words and store it internally.

  $b->Compute( qw( foo bar baz foobar bazbaz ) );

Makes use of the QuantizeV method.

QuantizeV

Convert a term into an integer; must return an integer in the range 0 .. $Text::Bloom::p-1.

It is called as

  my $hash = $b->QuantizeV( $term );

The current version is designed for strings matching /[a-z0-9]+/. Other characters do not cause errors, but degrade the hash function performance.

This function is a likely candidate for redefinition.

HashV

Convert an integer to a (smaller) integer, according to one of a class of similar functions.

It is internally called as:

  my $index = $b->HashV( $order, $value );

The $value must belong to the interval 0..$Text::Bloom::p-1, while the index must lie in 0..size-1. $order is a small integer from 0 to d-1.

The default implementation is

  index = m[order] * value + q[order]   (mod size) 

the values of m and q are taken from the array @Text::Bloom::hashParam; the form of the function is taken from [2].

WriteToString

Convert the Bloom signature into a string which can be saved and later restored with NewFromString. Compute must have been called previously.

  my $str = $b->WriteToString();

The string begins with a header which encodes the originating package, its version, the parameters of the current instance.

Whenever possible, Compress::Zlib is used in order to compress the bit vector in the most efficient way. On systems without Compress::Zlib, the bit string is saved uncompressed.

WriteToFile

These convenience functions just call their String counterparts and read/write the file specified in the argument.

  $b->WriteToFile( 'foo.sig' );

AUTHORS

  spinellia@acm.org (Andrea Spinelli)
  walter@humans.net (Walter Vannini)

BIBLIOGRAPHY

[1]

Burton H. Bloom, "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM, 13, 7, July 1970, pages 422-426. (available electronically from ACM Digital Library).

[2]

M. V. Ramakrishna, "Practical Performance of Bloom FIlters and Parallel Free-Text Searching", Communications of the ACM, 32, 10, October 1989, pages 1237-1239. (available electronically from ACM Digital Library).

BUGS

On Win32 we have experienced some instabilities when dealing with a large number of signatures; in this case Perl crashes without apparent explanation. The main suspect is Bit::Vector, but without any evidence.

HISTORY

  2001-11-02 - initial revision