☼ 林永忠 ☼ > Lingua-CJK-Tokenizer > Lingua::CJK::Tokenizer

Download:
Lingua-CJK-Tokenizer-0.01.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Source  

NAME ^

Lingua::CJK::Tokenizer - CJK Tokenizer

SYNOPSIS ^

    my $tknzr = Lingua::CJK::Tokenizer->new();
    $tknzr->ngram_size(5);
    $tknzr->max_token_count(100);
    $tokens_ref = $tknzr->tokenize("CJK Text");
    $tokens_ref = $tknzr->segment("CJK Text");
    $tokens_ref = $tknzr->split("CJK Text");
    $flag = $tknzr->has_cjk("CJK Text");
    $flag = $tknzr->has_cjk_only("CJK Text");

DESCRIPTION ^

This module tokenizes CJK texts into n-grams.

METHODS ^

ngram_size

sets the size of returned n-grams

max_token_count

sets the limit on the number of returned n-grams in case input text is too long or of indefinite size

tokenize

tokenizes texts into n-grams

segment

cuts cjk texts into chunks

split

tokenizes texts into uni-grams.

has_cjk

returns true if text has cjk characters

has_cjk_only

returns true if text has only cjk characters

PREREQUISITE ^

This module requires libunicode by Tom Tromey.

COPYRIGHT ^

Copyright (c) 2009 Yung-chung Lin.

This program is free software; you can redistribute it and/or modify it under the MIT License.

syntax highlighting: