NAME

Lingua::Sentence - Perl extension for breaking text paragraphs into sentences

SYNOPSIS

        use Lingua::Sentence;

        my $splitter = Lingua::Sentence->new("en");

        my $text = 'This is a paragraph. It contains several sentences. "But why," you ask?';

        print $splitter->split($text);

DESCRIPTION

This module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus (http://www.statmt.org/europarl/).

The module uses punctuation and capitalization clues to split paragraphs into an newline-separated string with one sentence per line. For example:

        This is a paragraph. It contains several sentences. "But why," you ask?

goes to:

        This is a paragraph.
        It contains several sentences.
        "But why," you ask?

Languages currently supported by the module are:

Catalan
Dutch
English
French
German
Greek
Italian
Portuguese
Spanish

Nonbreaking Prefixes Files

Nonbreaking prefixes are loosely defined as any word ending in a period that does NOT indicate an end of sentence marker. A basic example is Mr. and Ms. in English.

The sentence splitter module uses the nonbreaking prefix files included in this distribution.

To add a file for other languages, follow the naming convention nonbreaking_prefix.?? and use the two-letter language code you intend to use when creating a Lingua::Sentence object.

The sentence splitter module will first look for a file for the language it is processing, and fall back to English if a file for that language is not found.

For the splitter, normally a period followed by an uppercase word results in a sentence split. If the word preceeding the period is a nonbreaking prefix, this line break is not inserted.

A special case of prefixes, NUMERIC_ONLY, is included for special cases where the prefix should be handled ONLY when before numbers. For example, "Article No. 24 states this." the No. is a nonbreaking prefix. However, in "No. It is not true." No functions as a word.

See the example prefix files included in the distribution for more examples.

CREDITS

Thanks for the following individuals for supplying nonbreaking prefix files: Bas Rozema (Dutch), Hilário Leal Fontes (Portuguese), Jesús Giménez (Catalan & Spanish)

EXPORT

new($lang_id): Instantiate an object to split sentences in language $lang_id. If the language is not supported, a splitter object for English will be instantiated.
new($lang_id,$nonbreaking_prefix_file): Instantiate an object to split sentences in language $lang_id and the nonbreaking prefix file $nonbreaking_prefix_file. If the file does not exist, a splitter object for English will be instantiated.
split($text): Split sentences in $text by inserting newline characters at the sentence breaks. The resulting string is also terminated with a newline.
split_array($text): Split sentences in $text into an array of sentences.

SUPPORT

Bugs should always be submitted via the project hosting bug tracker

http://code.google.com/p/corpus-tools/issues/list

For other issues, contact the maintainer.

AUTHOR

Achim Ruopp, <achimru@gmail.com>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

To install Lingua::Sentence, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::Sentence

CPAN shell

perl -MCPAN -e shell
install Lingua::Sentence

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)