NAME

Lingua::NATools::Lexicon - Encapsulates NATools Lexicon files

SYNOPSIS

  use Lingua::NATools::Lexicon;

  $lex = Lingua::NATools::Lexicon->new("file.lex");

  $word = $lex->word_from_id(2);

  $id = $lex->id_from_word("cavalo");

  @ids = $lex->sentence_to_ids("era uma vez um gato maltez");

  $sentence = $lex->ids_to_sentence(10,2,3,2,5,4,3,2,5);

  $lex->size;

  $lex->id_count(2);

  $lex->close;

DESCRIPTION

This module encapsulates the NATools Lexicon files, making them accessible using Perl. The implementation is based on OO philosophy. First, you must open a lexicon file using:

 $lex = Lingua::NATools::Lexicon->new("lexicon.file.lex");

When you have all done, do not forget to close it. This makes some memory frees, and is welcome for the process of opening new lexicon files.

 $lex->close;

Lexicon files map words to identifiers and vice-versa. Its usage is simple: use

  $lex->id_from_word($word)

to get an id for a word. Use

  $lex->word_from_id($id)

to get back the word from the id. If you need to make big quantities of conversions to construct or parse a sentence use ids_to_sentence or sentence_to_ids respectively.

`new`

This is the Lingua::NATools::Lexicon constructor. Pass it a lexicon file. These files usually end with a .lex extension:

   my $lexicon = Lingua::NATools::Lexicon->new("file.lex");

`save`

This method saves the current lexicon object in the supplied file:

   $lexicon->save("/there/lexicon.lex");

`close`

Call this method to close a Lexicon. This is important to free resources (both memory and lexicons, as there is a limited number of open lexicons at a time).

   $lexicon->close;

`word_from_id`

This method is used to convert one word-id to a word:

   my $word = $lexicon->word_from_id ($word_id);

`ids_to_sentence`

This method calls word_from_id for each passed parameter. Thus, it receives a list of word identifiers, and returns the corresponding string. Words are separated by a space character.

   my $sentence = $lexicon->ids_to_sentence(1,3,5,2,3,6);

`id_from_word`

This method is used to convert one word to its corresponding identifier (word-id).

    my $word_id = $lexicon->id_from_word( $word );

`sentence_to_ids`

This method calls id_from_word for each word from a sentence. Note that the method does not perform the common tokenization task. It just splits the sentence by the space character. You must preprocess the string using a NLP tokenizer.

The method returns a reference to the list of identifiers.

  my $wid_list = $lexicon->sentence_to_ids("a sentence");

`id_count`

This method returns the number of occurrences for a specific word. Note that the word must be supplied as its identifier, and not the string itself.

  my $count = $lexicon->id_count( 45 );

`occurrences`

This method returns the size of the corpus (number of tokens) that originated the lexicon: it sums up occurrences for each word, and returns the total value.

   my $total = $lexicon->occurrences;

`size`

This method returns the number of different words (types) from the corpus that originated the lexicon.

  my $size = $lexicon->size;

`add_word`

This method adds a new word to the lexicon file. The word will be created with an occurrence count of 1.

Note that lexicon files can't be created from scratch using this module. The module is intended to manipulate already created lexicon files. A standard lexicon file doesn't have space for new words. You need to enlarge it before. Use the size method to know the current size, and the enlarge method to add some empty space.

   $lexicon->add_word("dog");

`set_id_count`

After creating a new word (or in an old word...) you might want to change its occurrence. Call this method for that. Pass it the word identifier and the new occurrence count.

This method is benevolent and let you set a negative occurrence count. Setting an occurrence count to 0 will not delete the word entry.

   $lexicon->set_id_count( $wid, ++$count);

`enlarge`

This method creates extra space for new words. You do not need to know its current size, just the number of words you need to add. Pass that as the argument to the method. The returning object should accomodate that more words. Also, try to call this method as few times as possible. First calculate the amount of words you need, then enlarge the Lexicon.

   $lexicon->enlarge( 100 ); # 100 more words

AUTHOR

Alberto Manuel Brandao Simoes, <albie@alfarrabio.di.uminho.pt>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the GNU General Public License 2, which you should find on parent directory. Distribution of this module should be done including all NATools package, with respective copyright notice.

To install Lingua::NATools, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::NATools

CPAN shell

perl -MCPAN -e shell
install Lingua::NATools

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

new

save

close

word_from_id

ids_to_sentence

id_from_word

sentence_to_ids

id_count

occurrences

size

add_word

set_id_count

enlarge

SEE ALSO