View on
MetaCPAN is shutting down
For details read Perl NOC. After June 25th this page will redirect to
Chen Yirong (春江) > Lingua-ZH-WordSegment-0.04 > Lingua::ZH::WordSegment



Annotate this POD

View/Report Bugs
Module Version: 0.04   Source  


Lingua::ZH::WordSegment - Simple Simplified Chinese Word Segmentation


        use Lingua::ZH::WordSegment;
        print seg($str_in);
        seg_STDIO();# Read from STDIN, and print the segmented result to STDOUT

        set_dic($dictionary_file_name); #load word from the file, this is not a must
        perl -MLingua::ZH::WordSegment -e 'seg_STDIO();' < input_file > output_file


The default word list is extracted from People's Daily in Jan, 1998 owned by Institute of Computational Linguistics, Peking University, China

This code is mainly written by Joy, in July 4th, 2001.

This program is a perl version of left-right mandarin segmentor As LDC segmenter takes a long time to build the DB files which makes the the training process last too long time.

For ablation experiments, we do not need to create the DB files because the specific frequency dictionary will be used only once for each slice.

The algorithm for this segmenter is to search the longest word at each point from both left and right directions, and choose the one with higher frequency product.

The above is Joy's original declarations.


        seg($str_in);   # return the string of segmentation result.
        seg_STDIO();    # Read from STDIN, and print the segmented result to STDOUT
        #The format of the dictionary file for each line is: 
        # "chineseWord\tFrequency\n"
        #Notice that if you don't call set_dic, 
        #the default dictionary in GBK encoding will be loaded.
        #The default dictionary is extracted from corpus of the People's Daily, 
        #January, 1998. 
        #Thanks to Institute of Computational Linguistics, Peking University,China.



Rewrited by Chen Yirong < >, September 21, 2006 and modified in Feb 20, 2007. Original Author: Joy, in July 4th, 2001


Many thanks to Joy who made the code available. Thanks to the PKU Corpus (from Institute of Computational Linguistics, Peking University, China) to help to automatic generate the default dictionary.


This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


syntax highlighting: