Lingua::Stem::Snowball - Perl interface to Snowball stemmers.
use Lingua::Stem::Snowball; my @lang = stemmers();
OO interface:
my $lang = 'en'; my $dict = Lingua::Stem::Snowball->new(lang => $lang); # Test if $lang is correct die $@ if ($@); my $locale = 'C'; my $dict = Lingua::Stem::Snowball->new(lang => $lang, locale => $locale); my $lemm = $dict->stem($word); my $lemm = $dict->stem($word, \$is_stemmed); my $dict = Lingua::Stem::Snowball->new(); $dict->lang($lang); $dict->locale($locale); my $lemm = $dict->stem($word); my @lemm = $dict->stem(\@words);
Plain interface:
my $lemm = stem($lang, $word); my $lemm = stem($lang, $word, $locale); my $lemm = stem($lang, $word, $locale, \$is_stemmed);
This module provides unified perl interface to Snowball stemmers (http://snowball.tartarus.org) and virtually supports various languages. It's written using C for high performance and provides OO and plain interfaces.
The motivation of developing this module was to provide a generic access to stemming algorithms for OpenFTS project - full text search engine (http://openfts.sourceforge.net).
The module is very similar with Lingua::Stem. But Lingua::Stem is written in pure perl whereas Lingua::Stem::Snowball is an XS version of the snowball stemmers.
The following stemmers are available (as of Lingua::Stem 0.70):
|------------------------------| | Language | L:S | L:S:S | |------------------------------| | English | y | y | | French | y | y | | Spanish | n | y | | Portuguese | y | y | | Italian | y | y | | German | y | y | | Dutch | n | y | | Swedish | y | y | | Norwegian | y | y | | Danish | y | y | | Russian | n | y | | Finnish | n | y | | Galician | y | n | |------------------------------|
Here is a little benchmark with examples files from the snowball distribution (with no cache):
|---------------------------------------------------| | Language | Unique | Time (s) | | | words | L:S:S | L:S:S | L:S | L:S:S | | | | @ | $ | @ | $ | |---------------------------------------------------| | DA | 23829 | 0.5 | 1.1 | 7.3 | 14.2 | | DE | 35033 | 0.9 | 1.9 | 64.3 | 73.5 | | EN | 30428 | 0.7 | 1.5 | 2.5 | 8.8 | | FR | 20403 | 0.6 | 1.1 | 182.7 | 188.0 | | IT | 35494 | 1.0 | 2.0 | 345.6 | 350.2 | | NO | 20628 | 0.4 | 1.0 | 14.3 | 20.6 | | PT | 32016 | 0.8 | 1.7 | 405.6 | 414.8 | | SV | 30623 | 0.0 | 0.5 | 15.9 | 25.6 | |---------------------------------------------------|
Here is the same benchmark with all unique words found in the bible:
|---------------------------------------------------| | EN | 12718 | 0.3 | 0.7 | 1.0 | 3.6 | |---------------------------------------------------|
Creates a new instance of the stemmer.
The constructor takes hash style parameters. The following parameters are recognized:
lang: language (ISO code).
locale: locale.
Returns the stemmed word for $word.
Returns an array of the stemmed words contained in @words.
Accessor for the lang parameter. If there is no stemmer for $lang, the language is not changed.
Accessor for the locale parameter.
Returns a list of all available languages with a stemmer.
By default, the stemmer will not strip apostrophes for you. So, if you make the following call:
my @words = ('The', 'Ranger\'s', 'Digest'); my @stemmed = $dict->stem(\@words);
The result might not be what you expected (if you split(' ') a user search entry for example).
Stripping 's in perl can be a little expensive, so you can let the stemmer do it in C:
my @words = ('The', 'Ranger\'s', 'Digest'); $dict->strip_apostrophes(1); my @stemmed = $dict->stem(\@words);
This method strips 's (english) and l', d', ... (french).
Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball\@rt.cpan.org.
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.
Copyright 2004-2005
Currently maintained by Fabien Potencier, fabpot@cpan.org Original authors Oleg Bartunov, oleg@sai.msu.su, Teodor Sigaev, teodor@stack.net
This software may be freely copied and distributed under the same terms and conditions as Perl.
Snowball files and stemmers are covered by the BSD license.
http://snowball.tartarus.org, Lingua::Stem
To install Lingua::Stem::Snowball, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::Stem::Snowball
CPAN shell
perl -MCPAN -e shell install Lingua::Stem::Snowball
For more information on module installation, please visit the detailed CPAN module installation guide.