The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions

SYNOPSIS

  use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc
                      isotr disotr stop grundform utf8iso);

  $stem   = Stem($word);
  $scode  = Soundex($word);
  $pcode  = Phonix($word);
  $lword  = isolc($word);
  disolc($word);
  $uword  = isouc($word);
  disouc($word);
  $trword = isotr($word);
  disotr($word);
  $word   = stop($word);
  $word   = grundform($word);

  @words = WAIT::Filter::split($word);
  @words = WAIT::Filter::split2($word);
  @words = WAIT::Filter::split3($word);
  @words = WAIT::Filter::split4($word); # arbitrary numbers allowed

DESCRIPTION

This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.

Stem(word)

reduces word using the well know Porter algorithm.

  AU: Porter, M.F.
  TI: An Algorithm for Suffix Stripping
  JT: Program
  VO: 14
  PP: 130-137
  PY: 1980
  PM: JUL
Soundex(word)

computes the 4 byte Soundex code for word.

  AU: Gadd, T.N.
  TI: 'Fisching for Werds'. Phonetic Retrieval of written text in
      Information Retrieval Systems
  JT: Program
  VO: 22
  NO: 3
  PP: 222-237
  PY: 1988
Phonix(word)

computes the 8 byte Phonix code for word.

  AU: Gadd, T.N.
  TI: PHONIX: The Algorithm
  JT: Program
  VO: 24
  NO: 4
  PP: 363-366
  PY: 1990
  PM: OCT

ISO charcater case functions

There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word will contain the lowercased string.

  $word = isolc($word);
  $word = disolc($word);
  disolc($word);

Here are the hardcoded characters which are recognized:

  abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß
  ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new = isolc($word)
disolc($word)

transposes to lower case.

$new = isouc($word)
disouc($word)

transposes to upper case.

$new = isotr($word)
disotr($word)

Remove non-letters according to the above table.

$new = stop($word)

Returns an empty string if $word is a stopword.

$new = grundform($word)

Calls Text::German::reduce

$new = utf8iso($word)

Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.

split, split2, split3, ...

The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.

AUTHOR

Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>

SEE ALSO

perl(1).

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 648:

You forgot a '=back' before '=head1'

Around line 663:

Non-ASCII character seen before =encoding in 'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß'. Assuming CP1252

Around line 666:

'=item' outside of any '=over'

Around line 706:

You forgot a '=back' before '=head1'