Lingua::BrillTagger - Natural-language tokenizing and part-of-speech tagging
use Lingua::BrillTagger; my $t = Lingua::BrillTagger->new; # Load tagger information $t->load_lexicon($path); $t->load_bigrams($path); $t->load_lexical_rules($path); $t->load_contextual_rules($path); # Tag a sentence my $tagged = $t->tag($string); my $tagged = $t->tag(\@tokens); # Tokenize a sentence my $tokens = $t->tokenize($string);
Part-of-speech tagging is the act of assigning a part-of-speech label (noun, verb, etc.) to each token of a natural-language sentence.
There are many different ways to do this, resulting in lots of different styles of output and using various amounts of space & time resources. One of the most successful recent methods was developed by Eric Brill as part of his 1993 Ph.D. work at the University of Pennsylvania: "/www.cs.jhu.edu/~brill/dissertation.ps"" in "http:. It uses the notion of "transformation-based error-driven" learning, in which a sequence of transformational rules is learned to transform a naive part-of-speech tagging into a good tagging.
Lingua::BrillTagger, is a Perl wrapper around Brill's tagger. The tagger itself is written in C.
The following methods are available in the
Creates a new
Lingua::BrillTagger object and returns it. For initialization,
new() accepts a
lexicon_size parameter which should be a good guess integer of how many words are in your lexicon. It does not need to be precise, as it's just used to set the number of buckets in the lexicon hash (since it's not a perl hash but a custom Brill thingy, it really must be set to something reasonable). The default is 100,000.
Loads a LEXICON file, in the format described in the README.LONG file from the Brill tagger distribution. In a nutshell, the format of each line is "token tag1 tag2 ... tagn", where tag1 is the most likely tag for the given token. Calling this method is mandatory before tagging.
Loads a BIGRAMS file, in the format described in the README.LONG file from the Brill tagger distribution. Calling this method is optional.
Loads any extra words besides those in
LEXICON. Calling this method is optional.
Loads a LEXICALRULEFILE file, in the format described in the README.LONG file from the Brill tagger distribution. Calling this method is mandatory before tagging.
Loads a CONTEXTUALRULEFILE file, in the format described in the README.LONG file from the Brill tagger distribution. Calling this method is mandatory before tagging.
Invokes the tagging algorithm on a single sentence, and returns a two-element list containing a reference to an array of tokens, and a reference to a corresponding array of tags. The input may be specified as a string, in which case it will first be passed to the
tokenize() method; alternatively the input may be given as a reference to an array of tokens.
Runs a standard tokenization algorithm for English language free-text and returns the result as an array reference. The input should be specified as a string.
The Lingua::BrillTagger code will allow you to create more than one tagger object in the same perl script, by calling
new() more than once. There should be no problems in the Perl code with doing this, but because Brill's underlying C code was originally intended to run in a batch-mode with a single instance of the tagger, it may not work well in concurrency situations. If you run into problems, let me know, especially if you can give me a patch to fix it.
Ken Williams, <email@example.com>
The Lingua::BrillTagger perl interface is copyright (C) 2004 Thomson Legal & Regulatory, and written by Ken Williams. It is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The Brill Tagger is copyright (C) 1993 by the Massachusetts Institute of Technology and the University of Pennsylvania - you will find full copyright and license information in its distribution. The Tagger.patch file distributed here is granted under the same license terms as the tagger code itself.