Andrea Spinelli >
Text-Document-1.07 >
Text::Document

Module Version: 1.05
Text::Document - a text document subject to statistical analysis

my $t = Text::Document->new(); $t->AddContent( 'foo bar baz' ); $t->AddContent( 'foo barbaz; ' ); my @freqList = $t->KeywordFrequency(); my $u = Text::Document->new(); ... my $sj = $t->JaccardSimilarity( $u ); my $sc = $t->CosineSimilarity( $u ); my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );

`Text::Document`

allows to perform simple Information-Retrieval-oriented statistics on pure-text documents.

Text can be added in chunks, so that the document may be incrementally built, for instance by a class like `HTML::Parser`

.

A simple algorithm splits the text into terms; the algorithm may be redefined by subclassing and redefining `ScanV`

.

The `KeywordFrequency`

function computes term frequency over the whole document.

The package may be {re}used either by simple instantiation, or by subclassing (defining a descendant package). In the latter case the methods which are foreseen to be redefined are those ending with a `V`

suffix. Redefining other methods will require greater attention.

The creator method. The optional arguments are in the *(key,value)* form and allow to specify whether all keywords are trasformed to lowercase (default) and whether the string representation (`WriteToString`

) will be compressed (default).

my $d = Text::Document->new(); my $dNotCompressed = Text::Document( compressed => 0 ); my $dPreserveCase = Text::Document( lowercase => 0 );

Take a string written by `WriteToString`

(see below) and create a new `Text::Document`

with the same contents; call `die`

whenever the restore is impossible or ill-advised, for instance when the current version of the package is different from the original one, or the compression library in unavailable.

my $b = Text::Document::NewFromString( $str );

The return value is a blessed reference; put in another way, this is an alternative contructor.

The string should have been written by `WriteToString`

; you may of course tweak the string contents, but at this point you're entirely on you own.

Used as

$d->AddContent( 'foo bar baz foo9' ); $d->AddContent( 'mary had a little lamb' );

Successive calls accumulate content; there is currently no way of resetting the content to zero.

Returns a list of all distinct terms in the document, in no particular order.

Returns the number of occurrences of a given term.

$d->AddContent( 'foo baz bar foo foo'); my $n = $d->Occurrences( 'foo' ); # now $n is 3

Scan a string and return a list of terms.

Called internally as:

my @terms = $self->ScanV( $text );

Returns a reference list of pairs *[term,frequency]*, sorted by ascending frequency.

my $listRef = $d->KeywordFrequency(); foreach my $pair (@{$listRef}){ my ($term,$frequency) = @{$pair}; ... }

Terms in the document are sampled and their frequencies of occurrency are sorted in ascending order; finally, the list is returned to the user.

Convert the document (actually, some parameters and the term counters) into a string which can be saved and later restored with `NewFromString`

.

my $str = $d->WriteToString();

The string begins with a header which encodes the originating package, its version, the parameters of the current instance.

Whenever possible, `Compress::Zlib`

is used in order to compress the bit vector in the most efficient way. On systems without `Compress::Zlib`

, the bit string is saved uncompressed.

Compute the Jaccard measure of document similarity, which is defined as follows: given two documents *D* and *E*, let *Ds* and *Es* be the set of terms occurring in *D* and *E*, respectively. Define *S* as the intersection of *Ds* and *Es*, and *T* as their union. Then the Jaccerd similarity is the the number of elements of *S* divided by the number of elements of *T*.

It is called as follows:

my $sim = $d->JaccardSimilarity( $e );

If neither document has any terms the result is undef (a rare evenience). Otherwise the similarity is a real number between 0.0 (no terms in common) and 1.0 (all terms in common).

Compute the cosine similarity between two documents *D* and *E*.

Let *Ds* and *Es* be the set of terms occurring in *D* and *E*, respectively. Define *T* as the union of *Ds* and *Es*, and let *ti* be the *i*-th element of *T*.

Then the term vectors of *D* and *E* are

Dv = (nD(t1), nD(t2), ..., nD(tN)) Ev = (nE(t1), nE(t2), ..., nE(tN))

where nD(ti) is the number of occurrences of term ti in *D*, and nE(ti) the same for *E*.

Now we are at last ready to define the cosine similarity *CS*:

CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))

Here (... , ...) is the scalar product and Norm is the Euclidean norm (square root of the sum of squares).

`CosineSimilarity`

is called as

$sim = $d->CosineSimilarity( $e );

It is `undef`

if either *D* or *E* have no occurrence of any term. Otherwise, it is a number between 0.0 and 1.0. Since term occurrences are always non-negative, the cosine is obviously always non-negative.

Compute the weighted cosine similarity between two documents *D* and *E*.

In the setting of `CosineSimilarity`

, the term vectors of *D* and *E* are

Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN) Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)

The weights are nonnegative real values; each term has associated a weight. To achieve generality, weights may be defined using a function, like:

my $wcs = $d->WeightedCosineSimilarity( $e, \&function, $rock );

The `function`

will be called as follows:

my $weight = function( $rock, 'foo' );

`$rock`

is a 'constant' object used for passing a *context* to the function.

For instance, a common way of defining weights is the IDF (inverse document frequency), which is defined in Text::DocumentCollection. In this context, you can weigh terms with their IDF as follows:

$sim = $c->WeightedCosineSimilarity( $d, \&Text::DocumentCollection::IDF, $collection );

`WeightedCosineSimilarity`

will call

$collection->IDF( 'foo' );

which is what we expect.

Actually, we should return the square root of IDF, but this detail is not necessary here.

spinellia@acm.org (Andrea Spinelli) walter@humans.net (Walter Vannini)

2001-11-02 - initial revision 2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>

We did not use `Storable`

, because we wanted to fine-tune compression and version compatibility. However, this choice may be easily reversed redefining WriteToString and NewFromString.

syntax highlighting: