
Uplug::PreProcess::Tokenizer

my $tokenizer = new Uplug::PreProcess::Tokenizer( lang => 'en' ); my @tokens = tokenizer->tokenize( 'Mr. Smith says: "What is a text anyway?"' ); my $text = detokenize( '" Big improvement ! " says Mr. Smith .');

tokenizeTokenize a given text. Returns a list of tokens.
detokenizeDe-tokenize a space-separated text or a list of tokens. Returns plain text.
load_prefixesLoad language specific abbreviations and other non-breaking prefixes.

This module heavily relies on the implementation of the tokenizer and detokenizer used in the Moses toolkit for SMT. All credits go to the original authors (Josh Schroeder and Philipp Koehn).