Lingua::NATools::Lexicon - Encapsulates NATools Lexicon files
use Lingua::NATools::Lexicon; $lex = Lingua::NATools::Lexicon->new("file.lex"); $word = $lex->word_from_id(2); $id = $lex->id_from_word("cavalo"); @ids = $lex->sentence_to_ids("era uma vez um gato maltez"); $sentence = $lex->ids_to_sentence(10,2,3,2,5,4,3,2,5); $lex->size; $lex->id_count(2); $lex->close;
This module encapsulates the NATools Lexicon files, making them accessible using Perl. The implementation is based on OO philosophy. First, you must open a lexicon file using:
$lex = Lingua::NATools::Lexicon->new("lexicon.file.lex");
When you have all done, do not forget to close it. This makes some memory frees, and is welcome for the process of opening new lexicon files.
Lexicon files map words to identifiers and vice-versa. Its usage is simple: use
to get an id for a word. Use
to get back the word from the id. If you need to make big quantities of conversions to construct or parse a sentence use
This is the
Lingua::NATools::Lexicon constructor. Pass it a lexicon file. These files usually end with a
my $lexicon = Lingua::NATools::Lexicon->new("file.lex");
This method saves the current lexicon object in the supplied file:
Call this method to close a Lexicon. This is important to free resources (both memory and lexicons, as there is a limited number of open lexicons at a time).
This method is used to convert one word-id to a word:
my $word = $lexicon->word_from_id ($word_id);
This method calls
word_from_id for each passed parameter. Thus, it receives a list of word identifiers, and returns the corresponding string. Words are separated by a space character.
my $sentence = $lexicon->ids_to_sentence(1,3,5,2,3,6);
This method is used to convert one word to its corresponding identifier (word-id).
my $word_id = $lexicon->id_from_word( $word );
This method calls
id_from_word for each word from a sentence. Note that the method does not perform the common tokenization task. It just splits the sentence by the space character. You must preprocess the string using a NLP tokenizer.
The method returns a reference to the list of identifiers.
my $wid_list = $lexicon->sentence_to_ids("a sentence");
This method returns the number of occurrences for a specific word. Note that the word must be supplied as its identifier, and not the string itself.
my $count = $lexicon->id_count( 45 );
This method returns the size of the corpus (number of tokens) that originated the lexicon: it sums up occurrences for each word, and returns the total value.
my $total = $lexicon->occurrences;
This method returns the number of different words (types) from the corpus that originated the lexicon.
my $size = $lexicon->size;
This method adds a new word to the lexicon file. The word will be created with an occurrence count of 1.
Note that lexicon files can't be created from scratch using this module. The module is intended to manipulate already created lexicon files. A standard lexicon file doesn't have space for new words. You need to enlarge it before. Use the
size method to know the current size, and the
enlarge method to add some empty space.
After creating a new word (or in an old word...) you might want to change its occurrence. Call this method for that. Pass it the word identifier and the new occurrence count.
This method is benevolent and let you set a negative occurrence count. Setting an occurrence count to 0 will not delete the word entry.
$lexicon->set_id_count( $wid, ++$count);
This method creates extra space for new words. You do not need to know its current size, just the number of words you need to add. Pass that as the argument to the method. The returning object should accomodate that more words. Also, try to call this method as few times as possible. First calculate the amount of words you need, then enlarge the Lexicon.
$lexicon->enlarge( 100 ); # 100 more words
See perl(1) and NATools documentation.
Alberto Manuel Brandao Simoes, <email@example.com>
Copyright 2002-2012 by NATURA Project
This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.