WAIT::Filter - Perl extension providing the basic freeWAIS-sf reduction functions
use WAIT::Filter qw(Stem Soundex Phonix isolc disolc isouc disouc isotr disotr stop grundform utf8iso); $stem = Stem($word); $scode = Soundex($word); $pcode = Phonix($word); $lword = isolc($word); disolc($word); $uword = isouc($word); disouc($word); $trword = isotr($word); disotr($word); $word = stop($word); $word = grundform($word); @words = WAIT::Filter::split($word); @words = WAIT::Filter::split2($word); @words = WAIT::Filter::split3($word); @words = WAIT::Filter::split4($word); # arbitrary numbers allowed
This tiny modules gives access to the basic reduction functions build in freeWAIS-sf.
reduces word using the well know Porter algorithm.
AU: Porter, M.F. TI: An Algorithm for Suffix Stripping JT: Program VO: 14 PP: 130-137 PY: 1980 PM: JUL
computes the 4 byte Soundex code for word.
AU: Gadd, T.N. TI: 'Fisching for Werds'. Phonetic Retrieval of written text in Information Retrieval Systems JT: Program VO: 22 NO: 3 PP: 222-237 PY: 1988
computes the 8 byte Phonix code for word.
AU: Gadd, T.N. TI: PHONIX: The Algorithm JT: Program VO: 24 NO: 4 PP: 363-366 PY: 1990 PM: OCT
There are some additional function which transpose some/most ISOlatin1 characters to upper and lower case. To allow for maximum speed there are also destructive versions which change the argument instead of allocating a copy which is returned. For convenience, the destructive version also returns the argument. So all of the following is valid and $word will contain the lowercased string.
$word
$word = isolc($word); $word = disolc($word); disolc($word);
Here are the hardcoded characters which are recognized:
abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝß
$new =
($word)
transposes to lower case.
transposes to upper case.
Remove non-letters according to the above table.
Returns an empty string if $word is a stopword.
Calls Text::German::reduce
Convert UTF8 encoded strings to ISO-8859-1. WAIT currently is internally based on the Latin1 character set, so if you process anything in a different encoding, you should convert to Latin1 as the first filter.
The splitN funtions all take a scalar as input and return a list of words. Split acts just like the perl split(' '). Split2 eliminates all words from the list that are shorter than 2 characters (bytes), split3 eliminates those shorter than 3 characters (bytes) and so on.
Ulrich Pfeifer <pfeifer@ls6.informatik.uni-dortmund.de>
perl(1).
4 POD Errors
The following errors were encountered while parsing the POD:
You forgot a '=back' before '=head1'
Non-ASCII character seen before =encoding in 'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïñòóôõöøùúûüýß'. Assuming CP1252
'=item' outside of any '=over'
To install WAIT, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WAIT
CPAN shell
perl -MCPAN -e shell install WAIT
For more information on module installation, please visit the detailed CPAN module installation guide.