
TM::Corpus::Document - Topic Maps, Document

use TM::Corpus::Document;
my $d = new TM::Corpus::Document ({ mime => 'text/plain',
val => 'this is some text' });
# accessors
$val = $d->val ('new text');
$mime = $d->mime ('new/mime');
$url = $d->ref ('http://somewhere/some.txt');
my @tokens = $d->tokenize; # leaving defaults
# using some predefined tokenizing steps, in this order
my @tokens = $d->tokenize (tokenizers => 'NUMBER QUOTER COM&BO');
# using negative ones (i.e. throw things away)
my @tokens = $d->tokenize (tokenizers => 'COM&BO COM-BO -INTERPUNCT');
# using filters (detect numbers and throw them away
my @tokens = $d->tokenize (tokenizers => 'NUMBER !NUMBER');
# get also debugging output
my @tokens = $d->tokenize (tokenizers => 'NUMBER TAP !NUMBER TAP');
# define your own filters
$TM::Corpus::Document::FILTERS{'!4LETTER'} =
sub { $_ = shift; return length($_) == 4 ? '' : $_; };
my @tokens = $d->tokenize (tokenizers => 'WORDER !4LETTER');
# collect features, here single tokens and two subsequent tokens
my %features = $d->features (tokenizers => '...',
featurizers => 'TOKEN1 TOKEN2')

This package implements documents, i.e. document pertinent information, such as its content, the corresponding MIME type, maybe a reference to the document if it has one.
Most notable is functionality to find the tokens (i.e. word substrings) and derive from these also a feature vector for the document.


The constructor expects a hash reference with one or more of the following fields:
refA URI string to refer to the network address of the document. In Topic Maps parlor this will be the subject locator for the document topic.
valThe character stream associated with the document.
mimeThe MIME type of the content.
Accessor for the ref component of the document. Nothing happens with the other components.
Accessor for the val component of the document. Nothing happens with the other components.
Accessor for the mime component of the document. Nothing happens with the other components.
This method returns a list reference to recognized tokens.
To generate this, the method will first find an extractor according to the document's MIME type. That will extract text, but also relevant meta data, such as title, length, etc. Some extractors are predefined; you can get a list with
perl -MTM::Corpus::Document -e 'warn join ",", keys %TM::Corpus::Document::EXTRACTORS;'
The extractor can also be overridden:
$d->tokenize (extractor => sub { ... });
It gets the value (content) as first parameter.
In a second step the content stream of the document is analyzed for patterns, such as numbers, dates or words. To control from the outside what is relevant and what should be done in which order, this is specified with a simple language.
Example:
$d->tokenize (tokenizers => 'COM&BO COM-BO');
Positive tokenizers detect patters and bless them as valid tokens which will not be further analyzed or questioned:
WORDER: detects word in current locale
QUOTER: detects substrings wrapped in ""
NUMBER: detects decimal numbers
DATE: detects date specification in current locale (NOT IMPLEMENTED!)
COM&BO: detects patterns like AT&T
COM-BO: detects patterns like T-Mobile
Capitalize: detects capitalized wordsNegative tokenizers detect patterns and immediately throw them away:
-WORDER: everything which is left as text fragment is suppressed
-QUOTER: quoted text is suppressed
-NUMBER: decimal numbers are suppressed
-INTERPUNCT: interpunctations characters are suppressedFilters take existing tokens and either modify then, suppress them or pass them through (and suppress everything else).
You can override and extend tokenizers and filters by tampering with the hashes %TOKENIZERS and %FILTERS. You can hook in, for instance a stopword list like this:
my %stops = map { $_ => 1 } qw(Terror CIA HLS);
$TM::Corpus::Document::FILTERS{'!STOPS'} =
sub { $_ = shift; return $stops{$_} ? '' : $_; };
$d->tokenize (tokenizers => ' .... !STOPS ....');
featuresThis method computes the feature vector from a document. It accepts all parameters from method tokenize as it will invoke this first. Additionally you can specify how to tokenize
my %fv = $d->features (tokenizers => 'QUOTER NUMBER WORDER',
featurizers => 'TOKEN1 TOKEN2');
Following tokenizers are defined:
TOKEN1: occurrences of single tokens are counted in the document
TOKEN2: occurrences of two subsequent tokens in the document are counted
TOKEN3: group of 3 are counted
MIME: the MIME type is converted into some numeric valueYou can extend or modify the %FEATURIZERS hash to add your own featuritis.

No. Plucene tokenizing was NOT helpful.


Copyright 200[8] by Robert Barta, <drrho@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.