Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter.
This class is implemented in a pseudo language-independent way,
but it can be used as an ancestor for language-specific segmentation by overriding the method
around see Moose::Manual::MethodModifiers) or just by overriding methods
Returns list of sentences
Do the segmentation (handling
Adds newlines after terminal punctuation followed by an uppercase letter.
Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in langauge-specific descendants consider adding * period-ending items that never indicate sentence breaks * titles before names of persons etc.
Returns string with characters that can appear before the first word of a sentence
Returns string with characters that can appear after period (or other end-sentence symbol)
Martin Popel <firstname.lastname@example.org>
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.