Ulrich Pfeifer > WAIT > WAIT::Filter

Download:
WAIT-1.800.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Module Version: 1.9   Source  

NAME ^

WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions

SYNOPSIS ^

  use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
                      isotr disotr stop grundform utf8iso);

  $stem   = Stem($word);
  $scode  = Soundex($word);
  $pcode  = Phonix($word);
  $lword  = isolc($word);
  disolc($word);
  $uword  = isouc($word);
  disouc($word);
  $trword = isotr($word);
  disotr($word);
  $word   = stop($word);
  $word   = grundform($word);

  @words = WAIT::Filter::split($word);
  @words = WAIT::Filter::split2($word);
  @words = WAIT::Filter::split3($word);
  @words = WAIT::Filter::split4($word); # arbitrary numbers allowed

DESCRIPTION ^

This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.

Stem(word)

reduces word using the well know Porter algorithm.

  AU: Porter, M.F.
  TI: An Algorithm for Suffix Stripping
  JT: Program
  VO: 14
  PP: 130-137
  PY: 1980
  PM: JUL
Soundex(word)

computes the 4 byte Soundex code for word.

  AU: Gadd, T.N.
  TI: 'Fisching for Werds'. Phonetic Retrieval of written text in
      Information Retrieval Systems
  JT: Program
  VO: 22
  NO: 3
  PP: 222-237
  PY: 1988
Phonix(word)

computes the 8 byte Phonix code for word.

  AU: Gadd, T.N.
  TI: PHONIX: The Algorithm
  JT: Program
  VO: 24
  NO: 4
  PP: 363-366
  PY: 1990
  PM: OCT

ISO charcater case functions ^

There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word will contain the lowercased string.

  $word = isolc($word);
  $word = disolc($word);
  disolc($word);

Here are the hardcoded characters which are recognized:

  abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß
  ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new = isolc($word)
disolc($word)

transposes to lower case.

$new = isouc($word)
disouc($word)

transposes to upper case.

$new = isotr($word)
disotr($word)

Remove non-letters according to the above table.

$new = stop($word)

Returns an empty string if $word is a stopword.

$new = grundform($word)

Calls Text::German::reduce

$new = utf8iso($word)

Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.

split, split2, split3, ...

The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.

AUTHOR ^

Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>

SEE ALSO ^

perl(1).

syntax highlighting: