Alberto Manuel Brandão Simões > Lingua-NATools-v0.7.10 > Lingua::NATools::Lexicon



Annotate this POD

View/Report Bugs
Module Version: v0.7.10   Source  


Lingua::NATools::Lexicon - Encapsulates NATools Lexicon files


  use Lingua::NATools::Lexicon;

  $lex = Lingua::NATools::Lexicon->new("file.lex");

  $word = $lex->word_from_id(2);

  $id = $lex->id_from_word("cavalo");

  @ids = $lex->sentence_to_ids("era uma vez um gato maltez");

  $sentence = $lex->ids_to_sentence(10,2,3,2,5,4,3,2,5);





This module encapsulates the NATools Lexicon files, making them accessible using Perl. The implementation is based on OO philosophy. First, you must open a lexicon file using:

 $lex = Lingua::NATools::Lexicon->new("lexicon.file.lex");

When you have all done, do not forget to close it. This makes some memory frees, and is welcome for the process of opening new lexicon files.


Lexicon files map words to identifiers and vice-versa. Its usage is simple: use


to get an id for a word. Use


to get back the word from the id. If you need to make big quantities of conversions to construct or parse a sentence use ids_to_sentence or sentence_to_ids respectively.


This is the Lingua::NATools::Lexicon constructor. Pass it a lexicon file. These files usually end with a .lex extension:

   my $lexicon = Lingua::NATools::Lexicon->new("file.lex");


This method saves the current lexicon object in the supplied file:



Call this method to close a Lexicon. This is important to free resources (both memory and lexicons, as there is a limited number of open lexicons at a time).



This method is used to convert one word-id to a word:

   my $word = $lexicon->word_from_id ($word_id);


This method calls word_from_id for each passed parameter. Thus, it receives a list of word identifiers, and returns the corresponding string. Words are separated by a space character.

   my $sentence = $lexicon->ids_to_sentence(1,3,5,2,3,6);


This method is used to convert one word to its corresponding identifier (word-id).

    my $word_id = $lexicon->id_from_word( $word );


This method calls id_from_word for each word from a sentence. Note that the method does not perform the common tokenization task. It just splits the sentence by the space character. You must preprocess the string using a NLP tokenizer.

The method returns a reference to the list of identifiers.

  my $wid_list = $lexicon->sentence_to_ids("a sentence");


This method returns the number of occurrences for a specific word. Note that the word must be supplied as its identifier, and not the string itself.

  my $count = $lexicon->id_count( 45 );


This method returns the size of the corpus (number of tokens) that originated the lexicon: it sums up occurrences for each word, and returns the total value.

   my $total = $lexicon->occurrences;


This method returns the number of different words (types) from the corpus that originated the lexicon.

  my $size = $lexicon->size;


This method adds a new word to the lexicon file. The word will be created with an occurrence count of 1.

Note that lexicon files can't be created from scratch using this module. The module is intended to manipulate already created lexicon files. A standard lexicon file doesn't have space for new words. You need to enlarge it before. Use the size method to know the current size, and the enlarge method to add some empty space.



After creating a new word (or in an old word...) you might want to change its occurrence. Call this method for that. Pass it the word identifier and the new occurrence count.

This method is benevolent and let you set a negative occurrence count. Setting an occurrence count to 0 will not delete the word entry.

   $lexicon->set_id_count( $wid, ++$count);


This method creates extra space for new words. You do not need to know its current size, just the number of words you need to add. Pass that as the argument to the method. The returning object should accomodate that more words. Also, try to call this method as few times as possible. First calculate the amount of words you need, then enlarge the Lexicon.

   $lexicon->enlarge( 100 ); # 100 more words


See perl(1) and NATools documentation.


Alberto Manuel Brandao Simoes, <>


Copyright 2002-2012 by NATURA Project

This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.

syntax highlighting: