Dušan Variš > Treex-Unilang > Treex::Block::W2A::SegmentOnNewlines

Download:
Treex-Unilang-0.13095.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.13095   Source  

NAME ^

Treex::Block::W2A::SegmentOnNewlines - segment text on new lines

VERSION ^

version 0.13095

DESCRIPTION ^

The source text is segmented into sentences which are stored in document bundles. If the document contained no bundles, the bundles are created. Otherwise, the document must contain the same number of bundles as the number of sentences (segmented by this blocks). This means that this block (or its derivatives) can be used in this way:

    treex Read::Text language=en from=en.txt \
          Read::Text language=de from=de.txt \
          W2A::SegmentOnNewlines language=en \
          W2A::SegmentOnNewlines language=de \
          ...

This class detects sentences based on the newlines in the source text, but it can be used as an ancestor for more apropriate segmentations by overriding the method get_segments.

ATTRIBUTES ^

allow_empty_sentences

If set, empty sentences can be produced.

delete_empty_sentences

If set, empty sentences are automatically deleted.

If none of the previous attributes is set and empty sentence found, fatal is raised.

METHODS ^

my @sentences = $self->get_segments($text)

This method produces a list of segments (sentences) from the given text string. This implementation only splits on newlines. It is supposed to be overriden in subclasses.

my $norm_sentence = $self->normalize_sentence($raw_sentence)

This method does sentence normalization, e.g. trims initial and terminal whitespaces.

AUTHOR ^

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE ^

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: