Chen Yirong (春江) > Lingua-ZH-WordSegment-0.04 > Lingua::ZH::WordSegment

Download:
Lingua-ZH-WordSegment-0.04.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.04   Source  

NAME ^

Lingua::ZH::WordSegment - Simple Simplified Chinese Word Segmentation

SYNOPSIS ^

        use Lingua::ZH::WordSegment;
        print seg($str_in);
        seg_STDIO();# Read from STDIN, and print the segmented result to STDOUT

        set_dic($dictionary_file_name); #load word from the file, this is not a must
        perl -MLingua::ZH::WordSegment -e 'seg_STDIO();' < input_file > output_file

DESCRIPTION ^

The default word list is extracted from People's Daily in Jan, 1998 owned by Institute of Computational Linguistics, Peking University, China

This code is mainly written by Joy, joy@cs.cmu.edu in July 4th, 2001.

This program is a perl version of left-right mandarin segmentor As LDC segmenter takes a long time to build the DB files which makes the the training process last too long time.

For ablation experiments, we do not need to create the DB files because the specific frequency dictionary will be used only once for each slice.

The algorithm for this segmenter is to search the longest word at each point from both left and right directions, and choose the one with higher frequency product.

The above is Joy's original declarations.

METHODS ^

        seg($str_in);   # return the string of segmentation result.
        seg_STDIO();    # Read from STDIN, and print the segmented result to STDOUT
        
        set_dic($dictionary_file_name)
        #The format of the dictionary file for each line is: 
        # "chineseWord\tFrequency\n"
        # 
        #Notice that if you don't call set_dic, 
        #the default dictionary in GBK encoding will be loaded.
        #The default dictionary is extracted from corpus of the People's Daily, 
        #January, 1998. 
        #Thanks to Institute of Computational Linguistics, Peking University,China.

SEE ALSO ^

AUTHORS ^

Rewrited by Chen Yirong < cyr.master@gmail.com >, September 21, 2006 and modified in Feb 20, 2007. Original Author: Joy, joy@cs.cmu.edu in July 4th, 2001

KUDOS ^

Many thanks to Joy who made the code available. Thanks to the PKU Corpus (from Institute of Computational Linguistics, Peking University, China) to help to automatic generate the default dictionary.

COPYRIGHT ^

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

syntax highlighting: