Treex::Tool::EnglishMorpho::Lemmatizer - rule based lemmatizer for English
use Treex::Tool::EnglishMorpho::Lemmatizer; my $lemmatizer = Treex::Tool::EnglishMorpho::Lemmatizer->new(); my ($word, $tag) = qw( goes VBZ ); my ($lemma, $neg) = $lemmatizer->lemmatize($word, $tag); # $lemma = 'go', $neg = 0 ($lemma, $neg) = $lemmatizer->lemmatize('unhappy', 'JJ'); # $lemma = 'happy', $neg = 1
Accepts pair of word and tag. Produces pair with its lemma and indication if word was negation
doesn't should be tokenized as two words: does and n't (It will be lemmatized as do and not).
Correct tagging (Penn style) is quite crucial for Lemmatizer to work. For example it doesn't change words with tags NN and NNP (it changes only NNS and NNPS). So (pence, NN) -> pence, but (pence, NNS) -> penny.
PEDT::MorphologyAnalysis uses Morpha (written in Flex) and in some cases gives different lemmatization.
Morpha leaves comparatives and superlatives unchanged.
PEDT::MorphologyAnalysis does only basic analysis (later -> lat).
Declination of words with latin origin is not covered by any Lemmatizer rules on purpose. There are few widely known english words with latin origin which are (or should be) covered by exception files (f.e. indices NNS -> index). In my opinion, it is better, especially for translation purposes, to leave the other latin words unchanged. Mostly they will have the same form also in the target language (biological terms like Spheniscidae). BTW: Errors made by Morpha latin fallbacks are sometimes funny: sci-fi -> sci-fus, Mitsubishi -> mitsubishus, Shanghai -> shanghaus,...
Martin Popel <email@example.com>
Copyright © 2008 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.