Search::Fulltext::Tokenizer::MeCab - Provides Japanese fulltext search for Search::Fulltext module
use Search::Fulltext; use Search::Fulltext::Tokenizer::MeCab; my $query = '猫'; my @docs = ( '我輩は猫である', '犬も歩けば棒に当る', '実家でてんちゃんって猫を飼ってまして,ものすっごい可愛いんですよほんと', ); my $fts = Search::Fulltext->new({ docs => \@docs, tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'", }); my $results = $fts->search($query); is_deeply($results, [0, 2]); # 1st & 3rd include '猫' my $results = $fts->search('猫 AND 可愛い'); is_deeply($results, [2]);
Search::Fulltext::Tokenizer::MeCab is a Japanse tokenizer working with fulltext search module Search::Fulltext. Only you have to do is specify perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer' as a tokenizer of Search::Fulltext.
perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'
tokenizer
my $fts = Search::Fulltext->new({ docs => \@docs, tokenizer => "perl 'Search::Fulltext::Tokenizer::MeCab::tokenizer'", });
You are supposed to use UTF-8 strings for docs.
docs
Although various queries are available like "QUERIES" in Search::Fulltext, wildcard query (e.g. '我*') and phrase query (e.g. '"我輩は猫である"') are not supported.
User dictionary can be used to change the tokenizing behavior of internally-used Text::MeCab. See ENVIRONMENTAL VARIABLES section for detailes.
Some environmental variables are provided to customize the behavior of Search::Fulltext::Tokenizer::MeCab.
Typical usage:
$ ENV1=foobar ENV2=buz perl /path/to/your_script_using_this_module ARGS
MECABDIC_USERDIC
Specify path(s) to MeCab's user dictionary.
See MeCab's manual to learn how to create user dictionary.
Examples:
MECABDIC_USERDIC="/path/to/yourdic1.dic" MECABDIC_USERDIC="/path/to/yourdic1.dic, /path/to/yourdic2.dic"
MECABDIC_DEBUG
When set to not 0, debug strings appear on STDERR.
Especially, outputs below would help check how your docs are tokenized.
string to be parsed: 我輩は猫である (7) token: 我輩 (2) token: は (1) token: 猫 (1) token: で (1) token: ある (2) ... string to be parsed: 猫 AND 可愛い (9) token: 猫 (1) string to be parsed: 可愛い (4) token: 可愛い (3)
Note that not only docs but also queries are also tokenized.
Bug reports and pull requests are welcome at https://github.com/laysakura/Search-Fulltext-Tokenizer-MeCab !
To read this manual via perldoc, use -t option for correctly displaying UTF-8 caracters.
perldoc
-t
$ perldoc -t Search::Fulltext::Tokenizer::MeCab
Version 1.05
Sho Nakatani <lay.sakura@gmail.com>, a.k.a. @laysakura
To install Search::Fulltext::Tokenizer::MeCab, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Search::Fulltext::Tokenizer::MeCab
CPAN shell
perl -MCPAN -e shell install Search::Fulltext::Tokenizer::MeCab
For more information on module installation, please visit the detailed CPAN module installation guide.