Tara Andrews > Text-TEI-Collate-1.0 > Text::TEI::Collate

Download:
Text-TEI-Collate-1.0.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
Report a bug
Module Version: 1.0   Source  

SYNOPSIS ^

  use Text::TEI::Collate;
  my $aligner = Text::TEI::Collate->new();

  # Read from strings.
  my @collated_texts = $aligner->align( $string1, $string2, [ .. $stringN ] );

  # Read from filehandles.
  my $fh1 = new IO::File;
  $fh1->open( $first_file, "<:utf8" );
  my $fh2 = new IO::File;
  $fh2->open( $first_file, "<:utf8" );
  # ...
  my @collated_from_fh = $aligner->align( $fh1, $fh2, [ .. $fhN ] );

DESCRIPTION ^

Text::TEI::Collate is the beginnings of a collation program for multiple (transcribed) manuscript copies of a known text. It is an object-oriented interface, mostly for the convenience of the author and for the ability to have global settings.

The object is the alignment engine, or "aligner". The method that a user will care about is "align"; the other methods in this file are public in case a user needs a subset of this package's functionality.

An aligner takes two or more texts; the texts can either be strings or IO::File objects. It returns two or more arrays -- one for each text input -- in which identical and similar words are lined up with each other, via empty-string padding.

* TODO: describe word objects

METHODS ^

new

Creates a new aligner object. Takes a hash of options; available options are listed.

debug - Default 0. The higher the number (between 0 and 3), the more the debugging output.
distance_sub - A reference to a function that calculates a Levenshtein-like distance between two words. Default is Text::WagnerFischer::distance.
fuzziness - The maximum allowable word distance for an approximate match, expressed as a percentage of Levenshtein distance / word length.
punct_as_word - Treat punctuation as separate words. Not yet implemented
not_punct - Takes an array ref full of characters that should not be treated as punctuation.
accents - Takes an array ref full of characters that should be treated as accent marks. (TODO: discuss diff between punctuation & accents)
canonizer - Takes a subroutine ref. The sub should take a string and return a string. If defined, it will be called to produce a canonical form of the string in question. Useful for getting rid of ligatures, un-composing characters, correcting common spelling mistakes, etc.

align

This is the meat of the program. Takes a list of strings, or a list of IO::File objects. (The latter is useful if the text you are collating is particularly long.) Returns a list of collated texts. Currently each "text" is simply a list of words, padded for collation with empty strings; soon it will be a list of word objects which I have yet to describe.

BUGS / TODO ^

AUTHOR ^

Tara L Andrews <aurum@cpan.org>