The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::StemTagPOS - Computes stemmed/POS tagged lists of text.

SYNOPSIS

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'The first sentence. Sentence number two.';
  my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
  dump $listOfStemmedTaggedSentences;

DESCRIPTION

Text::StemTagPOS uses the modules Lingua::Stem::Snowball and Lingua::EN::Tagger to do part-of-speech tagging and stemming of English text. It was developed to pre-process text for other modules. Encoding of all text should be in Perl's internal format; see Encode for converting text from various encodes to a Perl string.

CONSTRUCTOR

new

The method new creates an instance of the Text::StemTagPOS class with the following parameters:

isoLangCode
 isoLangCode => 'en'

isoLangCode is the ISO language code of the language that will be tagged and stemmed by the object. It must be 'en', which is the default; other languages may be added when POS taggers for them are added to CPAN.

endingSentenceTag
 endingSentenceTag => 'PP'

endingSentenceTag is the part-of-speech tag from Lingua::EN::Tagger that will be used to indicate the end of a sentence. The default is 'PP'. The value of endingSentenceTag must be a tag generated by the module Lingua::EN::Tagger; see method getListOfPartOfSpeechTags for all the possible tags; which are based on the Penn Treebank tagset.

listOfPOSTypesToKeep and/or listOfPOSTagsToKeep
 listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

The method getTaggedTextToKeep uses listOfPOSTypesToKeep and listOfPOSTagsToKeep to build the default list of the parts-of-speech to be retained when filtering previously tagged text. The default list is [qw(TEXTRANK_WORDS)], which is all the nouns and adjectives in the text, as used in the textrank algorithm. Permitted types for getTaggedTextToKeep are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. listOfPOSTagsToKeep provides finer control over the parts-of-speech to be retained. For a list of all the possible tags see method getListOfPartOfSpeechTags.

METHODS

getStemmedAndTaggedText

 getStemmedAndTaggedText (@Text, $Text, \@Text)

The method getStemmedAndTaggedText returns a hierarchy of array references containing the stemmed words, the original words, their part-of-speech tag, and their word position index within the original text. The hierarchy is of the form

  [
    [ # sentence level: first sentence.
      [ # word level: first word.
        stemmed word, original word, part-of-speech tag, word index, word position, word length
      ]
      [ # word level: second word.
        stemmed word, original word, part-of-speech tag, word index, word position, word length
      ]
      ...
    ]
    [ # sentence level: second sentence.
      [ # word level: first word.
        stemmed word, original word, part-of-speech tag, word index, word position, word length
      ]
      [ # word level: second word.
        stemmed word, original word, part-of-speech tag, word index, word position, word length
      ]
      ...
    ]
  ]

Its only parameters are any combination of strings of text as scalars, references to scalars, arrays of strings of text, or references to arrays of strings of text, etc... The following examples below show the various ways to call the method; note that the constants Text::StemTagPOS::WORD_STEMMED, Text::StemTagPOS::WORD_ORIGINAL, Text::StemTagPOS::WORD_POSTAG, Text::StemTagPOS::WORD_INDEX, Text::StemTagPOS::WORD_CHAR_POSITION, Text::StemTagPOS::WORD_CHAR_LENGTH, Text::StemTagPOS::WORD_SENTENCE_ID, and Text::StemTagPOS::WORD_USER_DEFINED, are used to access the information about each word.

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'The first sentence. Sentence number two.';
  my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
  dump $listOfStemmedTaggedSentences;

  #  dumps:
  #  [
  #    [
  #      ["the", "The", "/DET", 0, 0, 3, 0],
  #      [" ", " ", "/PGP", 1, 3, 1, 0],
  #      ["first", "first", "/JJ", 2, 4, 5, 0],
  #      [" ", " ", "/PGP", 3, 9, 1, 0],
  #      ["sentenc", "sentence", "/NN", 4, 10, 8, 0],
  #      [".", ".", "/PP", 5, 18, 1, 0],
  #      [" ", " ", "/PGP", 6, 19, 1, 0],
  #    ],
  #    [
  #      ["sentenc", "Sentence", "/NN", 7, 20, 8, 1],
  #      [" ", " ", "/PGP", 8, 28, 1, 1],
  #      ["number", "number", "/NN", 9, 29, 6, 1],
  #      [" ", " ", "/PGP", 10, 35, 1, 1],
  #      ["two", "two", "/CD", 11, 36, 3, 1],
  #      [".", ".", "/PP", 12, 39, 1, 1],
  #    ],
  #  ]

  my $word = $listOfStemmedTaggedSentences->[0][0];
  print
    'WORD_STEMMED: ' .
    "'" . $word->[Text::StemTagPOS::WORD_STEMMED] . "'\n" .
    'WORD_ORIGINAL: ' .
    "'" . $word->[Text::StemTagPOS::WORD_ORIGINAL] . "'\n" .
    'WORD_POSTAG: ' .
    "'" . $word->[Text::StemTagPOS::WORD_POSTAG] . "'\n" .
    'WORD_INDEX: ' .
    $word->[Text::StemTagPOS::WORD_INDEX] . "\n" .
    'WORD_CHAR_POSITION: ' .
    $word->[Text::StemTagPOS::WORD_CHAR_POSITION] . "\n" .
    'WORD_CHAR_LENGTH: ' .
    $word->[Text::StemTagPOS::WORD_CHAR_LENGTH] . "\n";

  #  prints:
  #  WORD_STEMMED: 'the'
  #  WORD_ORIGINAL: 'The'
  #  WORD_POSTAG: '/DET'
  #  WORD_INDEX: 0
  #  WORD_CHAR_POSITION: 0
  #  WORD_CHAR_LENGTH: 3

The following example shows the various ways the text can be passed to the method:

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'This is a sentence with seven words.';
  dump $stemTagger->getStemmedAndTaggedText ($text,
    [$text, \$text], ($text, \$text));

getTaggedTextToKeep

 getTaggedTextToKeep (listOfStemmedTaggedSentences => [...],
  listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]);

The method getTaggedTextToKeep returns all the array references of the words that have a part-of-speech tag that is of a type specified by listOfPOSTypesToKeep or listOfPOSTagsToKeep. The word lists returned have the same hierarchical sentence structure used by listOfStemmedTaggedSentences. Note listOfPOSTypesToKeep and listOfPOSTagsToKeep are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.

listOfStemmedTaggedSentences
 listOfStemmedTaggedSentences => [...]

listOfStemmedTaggedSentences is the array reference returned by getStemmedAndTaggedText or a previous call to getTaggedTextToKeep.

listOfPOSTypesToKeep and/or listOfPOSTagsToKeep
 listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

listOfPOSTypesToKeep and listOfPOSTagsToKeep define the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values for listOfPOSTypesToKeep are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value of listOfPOSTagsToKeep see the method getListOfPartOfSpeechTags. Note listOfPOSTypesToKeep and listOfPOSTagsToKeep are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'This is the first sentence. This is the last sentence.';
  my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
  dump $stemTagger->getTaggedTextToKeep (
    listOfStemmedTaggedSentences => $listOfStemmedTaggedSentences);

  #  dumps:
  #  [
  #    [
  #      ["first", "first", "/JJ", 6, 12, 5, 0],
  #      ["sentenc", "sentence", "/NN", 8, 18, 8, 0],
  #    ],
  #    [
  #      ["last", "last", "/JJ", 17, 40, 4, 1],
  #      ["sentenc", "sentence", "/NN", 19, 45, 8, 1],
  #    ],
  #  ]

getWordsPhrasesInTaggedText

 getWordsPhrasesInTaggedText (listOfStemmedTaggedSentences => ...,
    listOfPhrasesToFind => [...],  listOfPOSTypesToKeep => [...],
    listOfPOSTagsToKeep => [...]);

The method getWordsPhrasesInTaggedText returns a reference to an array where each entry in the array corresponds to the word or phrase in listOfPhrasesToFind. The value of each entry is a list of word indices where the words or phrases were found. Each list contains integer pairs of the form [first-word-index, last-word-index] where first-word-index is the index to the first word of the phrase and last-word-index the index of the last word. The values of the index are those assigned to the stemmed and tagged word in listOfStemmedTaggedSentences.

  [
    [ # first phrase locations
      [first word index, last word index],
      [first word index, last word index], ...]
    ]
    [ # second phrase locations
      [first word index, last word index],
      [first word index, last word index], ...]
    ]
    ...
  ]
listOfStemmedTaggedSentences
 listOfStemmedTaggedSentences => [...]

listOfStemmedTaggedSentences is the array reference returned by getStemmedAndTaggedText or getTaggedTextToKeep.

listOfPhrasesToFind
 listOfPhrasesToFind => [...]

listOfPhrasesToFind is an array reference containing a list of strings of text that are either single words or phrases that are to be located in the text provided by listOfStemmedTaggedSentences. Before the words or phrases are located they are filtered using listOfPOSTypesToKeep or listOfPOSTagsToKeep.

listOfPOSTypesToKeep and/or listOfPOSTagsToKeep
 listOfPOSTypesToKeep => [...], listOfPOSTagsToKeep => [...]

listOfPOSTypesToKeep and listOfPOSTagsToKeep defines the list of parts-of-speech types to be retained when filtering previously tagged text. Permitted values for listOfPOSTypesToKeep are are 'ALL', 'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION', 'TEXTRANK_WORDS', and 'VERBS'. For the possible value of listOfPOSTagsToKeep see the method getListOfPartOfSpeechTags. Note listOfPOSTypesToKeep and listOfPOSTagsToKeep are optional parameters, if neither is defined, then the values used when the object was instantiated are used. If one of them is defined, its values override the default values.

The code below illustrates the output format:

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'This is the first sentence. This is the last sentence.';
  my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
  dump $listOfStemmedTaggedSentences;
  my $listOfWordsOrPhrasesToFind = ['first sentence','this is',
    'third sentence', 'sentence'];
  my $phraseLocations = $stemTagger->getWordsPhrasesInTaggedText (
    listOfPOSTypesToKeep => [qw(ALL)],
    listOfStemmedTaggedSentences => $listOfStemmedTaggedSentences,
    listOfWordsOrPhrasesToFind => $listOfWordsOrPhrasesToFind);
  dump $phraseLocations;
  # [
  #   [[6, 8]],           # 'first sentence'
  #   [[0, 2], [11, 13]], # 'this is': note period in text has index 5.
  #   [],                 # 'third sentence'
  #   [[8, 8], [19, 19]]  # 'sentence'
  # ]

getListOfPartOfSpeechTags

The method getListOfPartOfSpeechTags takes no parameters. It returns an array reference where each item in the list is of the form [part of speech tag, description, examples]. It is meant for getting the part-of-speech tags that can be used to populate listOfPOSTagsToKeep.

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  dump $stemTagger->getListOfPartOfSpeechTags;

getListOfStemmedWordsInText

The method getListOfStemmedWordsInText returns an array reference of the sorted stemmed words in the text given by listOfStemmedTaggedSentences.

listOfStemmedTaggedSentences
 listOfStemmedTaggedSentences => [...]

listOfStemmedTaggedSentences is the array reference returned by getStemmedAndTaggedText or getTaggedTextToKeep of the text.

  use Text::StemTagPOS;
  use Data::Dump qw(dump);
  my $stemTagger = Text::StemTagPOS->new;
  my $text = 'The first sentence. Sentence number two.';
  my $listOfStemmedTaggedSentences = $stemTagger->getStemmedAndTaggedText ($text);
  dump $listOfStemmedTaggedSentences;

getListOfStemmedWordsInAllDocuments

The method getListOfStemmedWordsInAllDocuments returns an array reference of the sorted stemmed words of the intersection of all the words in the documents given by listOfStemmedTaggedDocuments;

listOfStemmedTaggedDocuments
 listOfStemmedTaggedDocuments => [...]

listOfStemmedTaggedDocuments is a list of document references returned by getStemmedAndTaggedText or getTaggedTextToKeep.

INSTALLATION

To install the module run the following commands:

  perl Makefile.PL
  make
  make test
  make install

If you are on a windows box you should use 'nmake' rather than 'make'.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>
 

BUGS

Please email bugs reports or feature requests to bug-text-stemtagpos@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-StemTagPOS. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

COPYRIGHT

Copyright (c) 2010 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

natural language processing, NLP, part of speech tagging, POS, stemming

SEE ALSO

Encode, Lingua::Stem::Snowball, Lingua::EN::Tagger, perlunicode, Text::Iconv, utf8

See the Lingua::EN::Tagger README file for a list of the part-of-speech tags.