Lingua::EN::Tokenizer::Offsets - Finds word (token) boundaries, and returns their offsets.
version 0.01_02
use Lingua::EN::Tokenizer::Offsets qw/token_offsets get_tokens/; my $str <<END Hey! Mr. Tambourine Man, play a song for me. I'm not sleepy and there is no place I’m going to. END my $offsets = token_offsets($str); ## Get the offsets. foreach my $o (@$offsets) { my $start = $o->[0]; my $length = $o->[1]-$o->[0]; my $token = substr($text,$start,$length) ## Get a token. # ... } ### or my $tokens = get_tokens($str); foreach my $token (@$tokens) { ## do something with $token }
Takes text as input and returns a tokenized version (space-separated tokens).
Takes text input and returns reference to array containin pairs of character offsets, corresponding to the tokens start and end positions.
Takes text input and splits it into tokens.
Minor adjusts to offsets (leading/trailing whitespace, etc)
First naive delimitation of tokens.
Given a list of token boundaries offsets and a text, returns an array with the text split into tokens.
Based on the original tokenizer written by Josh Schroeder and provided by Europarl http://www.statmt.org/europarl/.
Lingua::EN::Sentence::Offsets, Lingua::FreeLing3::Tokenizer
Andre Santos <andrefs@cpan.org>
This software is copyright (c) 2012 by Andre Santos.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Lingua::EN::Tokenizer::Offsets, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::EN::Tokenizer::Offsets
CPAN shell
perl -MCPAN -e shell install Lingua::EN::Tokenizer::Offsets
For more information on module installation, please visit the detailed CPAN module installation guide.