
Lingua::CJK::Tokenizer - CJK Tokenizer

my $tknzr = Lingua::CJK::Tokenizer->new();
$tknzr->ngram_size(5);
$tknzr->max_token_count(100);
$tokens_ref = $tknzr->tokenize("CJK Text");
$tokens_ref = $tknzr->segment("CJK Text");
$tokens_ref = $tknzr->split("CJK Text");
$flag = $tknzr->has_cjk("CJK Text");
$flag = $tknzr->has_cjk_only("CJK Text");

This module tokenizes CJK texts into n-grams.

sets the size of returned n-grams
sets the limit on the number of returned n-grams in case input text is too long or of indefinite size
tokenizes texts into n-grams
cuts cjk texts into chunks
tokenizes texts into uni-grams.
returns true if text has cjk characters
returns true if text has only cjk characters

This module requires libunicode by Tom Tromey.

Copyright (c) 2009 Yung-chung Lin.
This program is free software; you can redistribute it and/or modify it under the MIT License.