André Fernandes dos Santos > Lingua-EN-Tokenizer-Offsets-0.01_03 > Lingua::EN::Tokenizer::Offsets

Download:
Lingua-EN-Tokenizer-Offsets-0.01_03.tar.gz

Dependencies

Annotate this POD

Website

View/Report Bugs
Module Version: 0.01_03   Source   Latest Release: Lingua-EN-Tokenizer-Offsets-0.03

NAME ^

Lingua::EN::Tokenizer::Offsets - Finds word (token) boundaries, and returns their offsets.

VERSION ^

version 0.01_03

SYNOPSIS ^

    use Lingua::EN::Tokenizer::Offsets qw/token_offsets get_tokens/;
     
    my $str <<END
    Hey! Mr. Tambourine Man, play a song for me.
    I'm not sleepy and there is no place I’m going to.
    END

    my $offsets = token_offsets($str);     ## Get the offsets.
    foreach my $o (@$offsets) {
        my $start  = $o->[0];
        my $length = $o->[1]-$o->[0];

        my $token = substr($text,$start,$length)  ## Get a token.
        # ...
    }

    ### or

    my $tokens = get_tokens($str);     
    foreach my $token (@$tokens) {
        ## do something with $token
    }

METHODS ^

tokenize($text)

Returns a tokenized version of $text (space-separated tokens).

$text can be a scalar or a scalar reference.

get_offsets($text)

Returns a reference to an array containin pairs of character offsets, corresponding to the start and end positions of tokens from $text.

$text can be a scalar or a scalar reference.

get_tokens($text)

Splits $text it into tokens, returning an array reference.

$text can be a scalar or a scalar reference.

adjust_offsets($text,$offsets)

Minor adjusts to offsets (leading/trailing whitespace, etc)

$text can be a scalar or a scalar reference.

initial_offsets($text)

First naive delimitation of tokens.

$text can be a scalar or a scalar reference.

offsets2tokens($text,$offsets)

Given a list of token boundaries offsets and a text, returns an array with the text split into tokens.

$text can be a scalar or a scalar reference.

ACKNOWLEDGEMENTS ^

Based on the original tokenizer written by Josh Schroeder and provided by Europarl http://www.statmt.org/europarl/.

SEE ALSO ^

Lingua::EN::Sentence::Offsets, Lingua::FreeLing3::Tokenizer

AUTHOR ^

André Santos <andrefs@cpan.org>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2012 by Andre Santos.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: