Gene Boggs > Lingua-TokenParse-0.1601 > Lingua::TokenParse

Download:
Lingua-TokenParse-0.1601.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.1601   Source  

NAME ^

Lingua::TokenParse - Parse a word into scored, fragment combinations

SYNOPSIS ^

  use Lingua::TokenParse;
  my $p = Lingua::TokenParse->new(
    word => 'antidisthoughtlessfulneodeoxyribonucleicfoo',
    lexicon => {
        a    => 'not',
        anti => 'opposite',
        di   => 'two',
        dis  => 'separation',
        eo   => 'hmmmmm',  # etc.
    },
    constraints => [ qr/eo(?:\.|$)/ ], # no parts ending in eo allowed
  );
  print Data::Dumper($p->knowns);

DESCRIPTION ^

This class represents a Lingua::TokenParse object and contains methods to parse a given word into familiar combinations based on a lexicon of known word parts. This lexicon is a simple fragment => definition list.

Words like "automobile" and "deoyribonucleic" are composed of different roots, prefixes, suffixes, etc. With a lexicon of known fragments, a word can be partitioned into a list of its (possibly overlapping) known and unknown fragment combinations.

These combinations can be given a score, which represents a measure of word familiarity. This measure is a set of ratios of known to unknown fragments and letters.

METHODS ^

new

  $p = Lingua::TokenParse->new(
      verbose => 0,
      word => $word,
      lexicon => \%lexicon,
      lexicon_file => $lexicon_file,
      constraints => \@constraints,
  );

Return a new Lingua::TokenParse object.

This method will automatically call the partition methods (detailed below) if a word and lexicon are provided.

The word can be any string, however, you will want to make sure that it does not include the same characters you use for the separator, not_defined and unknown strings (described below).

The lexicon must be a hash reference with word fragments as keys and definitions as their respective values.

parse

  $p->parse;
  $p->parse($string);

This method clears the partition lists and then calls all the individual parsing methods that are detailed below. If a string is provided the object's word attribute is reset to that, first.

build_parts

  $parts = $p->build_parts;

Construct an array reference of the word partitions.

build_definitions

  $known_definitions = $p->build_definitions;

Construct a table of the definitions of the word parts.

build_combinations

  $combos = $p->build_combinations;

Compute the array reference of all possible word part combinations.

build_knowns

  $raw_knowns = $p->build_knowns;

Compute the familiar word part combinations.

This method handles word parts containing prefix and suffix hyphens, which encode information about what is a syntactically illegal word combination, which can be used to score (or even throw out bogus combinations).

lexicon_cache

  $p->lexicon_cache;
  $p->lexicon_cache( $lexicon_file );
  $p->lexicon_cache( lexicon_file => $lexicon_file );

Backup and retrieve the hash reference of token entries.

If this method is called with no arguments, the object's lexicon_file is used. If the method is called with a single argument, the object's lexicon_file attribute is temporarily overridden. If the method is called with two arguments and the first is the string "lexicon_file" then that attribute is set before proceeding.

CONVENIENCE METHOD ^

output_knowns

  @known_list = $p->output_knowns;
  print Dumper \@known_list;

  # Look at the "even friendlier output."
  print scalar $p->output_knowns(
      separator   => $separator,
      not_defined => $not_defined,
      unknown     => $unknown,
  );

This method returns the familiar word part combinations in a couple "human accessible" formats. Each have familiarity scores rounded to two decimals and fragment definitions shown in a readable layout

separator

The the string used between fragment definitions. Default is a plus symbol surrounded by single spaces: ' + '.

not_defined

Indicates a known fragment that has no definition. Default is a single period: '.'.

unknown

Indicates an unknown fragment. The default is the question mark: '?'.

ACCESSORS ^

word

  $p->word($word);
  $word = $p->word;

The actual word to partition which can be any string.

lexicon

  $p->lexicon(%lexicon);
  $p->lexicon(\%lexicon);
  $lexicon = $p->lexicon;

The lexicon is a hash reference with word fragments as keys and definitions their respective values. It can be set with either a hash or a hash reference.

parts

  $parts = $p->parts;

The computed array reference of all possible word partitions.

combinations

  $combinations = $p->combinations;

The computed array reference of all possible word part combinations.

knowns

  $knowns = $p->knowns;

The computed hash reference of known (non-zero scored) combinations with their familiarity values.

definitions

  $definitions = $p->definitions;

The hash reference of the definitions provided for each fragment of the combinations with the values of unknown fragments set to undef.

constraints

  $constraints = $p->constraints;
  $p->constraints(\@regexps);

An optional, user defined array reference of regular expressions to apply when constructing the list of parts and combinations. This acts as a negative pruning device, meaning that if a match is successful, the entry is excluded from the list.

EXAMPLES ^

Example code can be found in the distribution eg/ directory.

TO DO ^

Turn the lame output_knowns method into a sensible XML serializer (of optionally everything).

Compute the time required for a given parse.

Make a method to add definitions for unknown fragments and call it... learn().

Use traditional stemming to trim down the common knowns and see if the score is the same...

Synthesize a term list based on a thesaurus of word-part definitions. That is, go in reverse. Non-trivial!

SEE ALSO ^

Storable

Math::BaseCalc

DEDICATION ^

For my Grandmother and English teacher Frances Jones.

THANK YOU ^

Thank you to Luc St-Louis for helping me increase the speed while eliminating the exponential memory footprint. I wish I knew your email address so I could tell you. lucs++

AUTHOR ^

Gene Boggs <gene@cpan.org>

COPYRIGHT AND LICENSE ^

Copyright (C) 2003-2004 by Gene Boggs

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: