Joerg Tiedemann > uplug-main-0.3.8 > Uplug::PreProcess::Tokenizer

Download:
uplug-main-0.3.8.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Source  

NAME ^

Uplug::PreProcess::Tokenizer

SYNOPSIS ^

 my $tokenizer = new Uplug::PreProcess::Tokenizer( lang => 'en' );
 my @tokens = tokenizer->tokenize( 'Mr. Smith says: "What is a text anyway?"' );
 my $text = detokenize( '" Big improvement ! " says Mr. Smith .');

IMPLEMENTS ^

tokenize

Tokenize a given text. Returns a list of tokens.

detokenize

De-tokenize a space-separated text or a list of tokens. Returns plain text.

load_prefixes

Load language specific abbreviations and other non-breaking prefixes.

DESCRIPTION ^

This module heavily relies on the implementation of the tokenizer and detokenizer used in the Moses toolkit for SMT. All credits go to the original authors (Josh Schroeder and Philipp Koehn).

syntax highlighting: