The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::WSD::CorpusBased::Corpus

SYNOPSIS

    my $wn = WordNet::QueryData->new;
    my $corpus = Lingua::EN::WSD::CorpusBased::Corpus->new('corpus' => '_democorpus_',
                                                           'wnref' => $wn);
    
    print join(' ', @{ $corpus->line(1) });            # prints 'hello world'
    print join(' ', @{ $corpus->sentences('hello') }); # prints '1'

DESCRIPTION

This module represents a corpus. Basically, it allows to extract the number of occurrences of a given word or a given word combination in a "fast" way. "fast" hereby means faster than just iterating over the lines and matching patterns. The basic access method is count().

If one calls init() once, the module stores an internal index, which lists for every word, in which sentences it occures.

This module is a helper module for Lingua::EN::WSD::CorpusBased.

METHODS

new

Creates a new Corpus object and reads in the corpus file.

Parameters:

debug If set to a true value, the module will generate some debug information to STDERR. Optional, default: 0.

wnref You can supply a reference to a WordNet::QueryData object. While reading the corpus, the words are then transformed to their stem forms. If you do not supply a value, the strings are used as they are in the corpus. If you supply a value other than a reference to a WordNet::QueryData object, the results are undefined (and untested ...). Optional, default: 0.

corpus The name of the file containing the corpus. The method expects to find the corpus sentence by sentence, each sentence in one line. Obligatory. For testing purposes, one can use '_democorpus_' as filename of the corpus. In this case, no file is read but instead the internal hard-coded corpus, which is included in the module, is used:

   hello world
   application e-mail
   world peace love
   e-mail program
   e-mail job

Returns: A blessed reference.

init

This method iterates over the corpus and indexes the words, so that it knows for each word in which sentences it occurs. No parameters. No return value.

count

This method expects a list of words as parameters and returns the number of sentences, in which (all of) these words occur in any order.

    $obj->count("hello", "world");

To make things clear: The method removes the first argument from the args list (which is the reference to the object itself) and takes the entire rest of the list as the list of words. Therefore

    count($obj, "hello", "world");

is equivalent to the line above.

sentences

This method takes the same arguments as count, namely a list of words. It then returns a list of sentences, in which each of these words occur. This method works only if init is run before, i.e. if the corpus is indexed.

line

This method takes a number larger than 0 and returns a reference to the list of words in this line of the corpus.

merge_lists

This method is used internally. It takes references to two lists as arguments and returns a reference to a new list, containing the elements that were in both lists.

    $obj->merge_lists(['a','b'],['b','c']); 

The above example returns a reference to the list

    ('b')

Note that the method makes only flat copies of the elements. If a list contains a reference to another list b, the reference in the new list still points to list b.

empty_cache

Empties the cache.

BUGS

Currently, the module is not able to return a list of sentences, in which the words occured. Since this is most unfortunate, it will change in future versions.

COPYRIGHT

Copyright (c) 2006 by Nils Reiter.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.