Stuart Watt > Search-FreeText-0.05 > Search::FreeText::LexicalAnalysis

Download:
Search-FreeText-0.05.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
Report a bug
Source  

NAME ^

Search::FreeText::LexicalAnalysis - basic lexical analyser for the open search system

DESCRIPTION ^

An open lexical analysis processor, which you can either override by subclassing, or which you can add your own filters to. Each filter is called with a reference to an array of words, and returns a reference to a new array of words. This is the process method, and the base class Search::FreeText::LexicalAnalysisProcess defines the protocol for each step in the pipeline.

SYNOPSIS ^

 # Selects default filters
 my $lexicalizer = new Search::FreeText::LexicalAnalysis ();
 # Selects named filters only
 my $lexicalizer = new Search::FreeText::LexicalAnalysis 
     (-filters => [ qw(MyLexicalAnalysis::Heuristics
                       Search::FreeText::LexicalAnalysis::Tokenize
                       Search::FreeText::LexicalAnalysis::Stop 
                       Search::FreeText::LexicalAnalysis::Stem) ]);

 my $words = $lexicalizer->process($text);

METHODS ^

new Search::FreeText::LexicalAnalysis( -search => searchmod [, -filters => FilterList] );

This is the main constructor for a lexicon. The -search parameter passes the search object instance, and is passed in turn to each of the filters, allowing them to look inside the search instance for any additional data if they need to.

You can use the -filters initialisation key to pass a list of classes for filters. By default the set of filters implements stemming, a reasonably complete stop list, and a few heuristics that tighten up the searching. the order of the filters is fairly important, and looks a bit like this:

Heuristics

Pattern-level heuristics that work on whole strings, implemented by default by Search::FreeText::LexicalAnalysis::Heuristics.

Tokenize

Splits a set of strings into an array of words. Implemented by default by Search::FreeText::LexicalAnalysis::Tokenize. Before this, strings represent documents; after this, they represent words, which is why its position in the list of filters is important.

Stop

Pass the array of words through a stop list filter, removing words that are likely to be irrelevant. Implemented by default by Search::FreeText::LexicalAnalysis::Stop.

Stem

Pass the array of words through a stemmer. Implemented by default by Search::FreeText::LexicalAnalysis::Stem, which in turn uses Lingua::Stem.

$self->initialize();

Initializes the lexical analyser, loading any modules that are needed for the list of filters.

$self->process(words...);

Passes the list of words to the filters as a pipeline. The array of words usually starts as a single string containing all the words, and one of the filters (Tokenize) turns this into an array of individual words. This allows some processing before words are split, as well as the usual stemming and stoplisting afterwards.

AUTHOR ^

Stuart Watt <S.N.K.Watt@rgu.ac.uk>

Copyright (c) 2003 The Robert Gordon University. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.