Lingua::RU::OpenCorpora::Tokenizer - tokenizer for OpenCorpora project
my $tokens = $tokenizer->tokens($text); my $bounds = $tokenizer->tokens_bounds($text);
This module tokenizes input texts in Russian language.
Note that it uses probabilistic algorithm rather than trying to parse the language. It also uses some pre-calculated data freely provided by OpenCorpora project.
NOTE: OpenCorpora periodically provides updates for this data. Checkout opencorpora-update-tokenizer script that comes with this distribution.
opencorpora-update-tokenizer
The algorithm is this:
In terms of this module context is just a binary vector, currently consisting of 17 elements. It's calculated for every character of the text, then it gets converted to decimal representation and then it's checked against "VECTORS FILE". Every element is a result of a simple function like _is_latin, _is_digit, _is_bracket and etc. applied to the input character and few characters around it.
_is_latin
_is_digit
_is_bracket
Contains a list of vectors with probability values showing the chance that given vector is a token boundary.
Built by OpenCorpora project from semi-automatically annotated corpus.
Contains a list of hyphenated Russian words. Used in vectors calculations.
Contains a list of char sequences that are not subjects to tokenizing.
Contains a list of common prefixes for decompound words.
NOTE: all files are stored as GZip archives and are not supposed to be edited manually.
Constructs and initializes new tokenizer object.
Arguments are:
Path to a directory with OpenCorpora data. Optional. Defaults to distribution directory (see File::ShareDir).
Takes text as input and splits it into tokens. Returns a reference to an array of tokens.
You can also pass a hashref with options as a second argument. Current options:
Minimal probability value for tokens boundary. Boundaries with lower probability are excluded from consideration.
Default value is 1, which makes tokenizer do splitting only when it's confident.
Takes text as input and finds bounds of tokens in the text. It doesn't split the text into tokens, it just marks where tokens could be.
Returns an arrayref of arrayrefs. Inner arrayref consists of two elements: boundary position in text and probability.
Lingua::RU::OpenCorpora::Tokenizer::Updater
http://mathlingvo.ru/nlpseminar/archive/s_49
OpenCorpora.org team http://opencorpora.org
This program is free software, you can redistribute it under the same terms as Perl itself.
To install Lingua::RU::OpenCorpora::Tokenizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::RU::OpenCorpora::Tokenizer
CPAN shell
perl -MCPAN -e shell install Lingua::RU::OpenCorpora::Tokenizer
For more information on module installation, please visit the detailed CPAN module installation guide.