The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

AI::Classifier::Text::Analyzer - computing feature vectors from documents

VERSION

version 0.03

SYNOPSIS

    use AI::Classifier::Text::Analyzer;

    my $analyzer = AI::Classifier::Text::Analyzer->new();
    
    my $features = $analyzer->analyze( 'aaaa http://www.example.com/bbb?xx=yy&bb=cc;dd=ff' );

DESCRIPTION

Computes feature vectors of text using some heuristics and adds words count (using Text::WordCounter by default).

The object is immutable - but some methods use a second parameter as an accumulator for the features found in given text.

It uses some specific values and methods that work for our case - but are not guaranteed to bring good results universally - see the source for details!

ATTRIBUTES

word_counter

Object with a word_count method that will calculate the frequency of words in a text document. By default Text::WordCounter.

global_feature_weight

The weight assigned for computed features of the text document. By default 2.

METHODS

new(word_counter => $foo, global_feature_weight => 3)

Creates a new AI::Classifier::Text::Analyzer object. Both arguments are optional.

analyze($document, $features)

Computes the feature vector of the given document and adds the initial vector of $features.

analyze_urls($document, $features)

Computes a vector special url related features of a given text - currently there are used NO_URLS, MANY_URLS and REPEATED_URLS features.

filter($document)

Removes html related parts from the text.

SEE ALSO

AI::NaiveBayes (3), AI::Classifier::Text(3)

AUTHOR

Zbigniew Lukasiak <zlukasiak@opera.com>, Tadeusz Sośnierz <tsosnierz@opera.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Opera Software ASA.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 146:

Non-ASCII character seen before =encoding in 'Sośnierz'. Assuming UTF-8