Lingua::ZH::WordSegment - Simple Simplified Chinese Word Segmentation
use Lingua::ZH::WordSegment; print seg($str_in); seg_STDIO();# Read from STDIN, and print the segmented result to STDOUT set_dic($dictionary_file_name); #load word from the file, this is not a must perl -MLingua::ZH::WordSegment -e 'seg_STDIO();' < input_file > output_file
The default word list is extracted from People's Daily in Jan, 1998 owned by Institute of Computational Linguistics, Peking University, China
This code is mainly written by Joy, firstname.lastname@example.org in July 4th, 2001.
This program is a perl version of left-right mandarin segmentor As LDC segmenter takes a long time to build the DB files which makes the the training process last too long time.
For ablation experiments, we do not need to create the DB files because the specific frequency dictionary will be used only once for each slice.
The algorithm for this segmenter is to search the longest word at each point from both left and right directions, and choose the one with higher frequency product.
The above is Joy's original declarations.
seg($str_in); # return the string of segmentation result. seg_STDIO(); # Read from STDIN, and print the segmented result to STDOUT set_dic($dictionary_file_name) #The format of the dictionary file for each line is: # "chineseWord\tFrequency\n" # #Notice that if you don't call set_dic, #the default dictionary in GBK encoding will be loaded. #The default dictionary is extracted from corpus of the People's Daily, #January, 1998. #Thanks to Institute of Computational Linguistics, Peking University,China.
Rewrited by Chen Yirong < email@example.com >, September 21, 2006 and modified in Feb 20, 2007. Original Author: Joy, firstname.lastname@example.org in July 4th, 2001
Many thanks to Joy who made the code available. Thanks to the PKU Corpus (from Institute of Computational Linguistics, Peking University, China) to help to automatic generate the default dictionary.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.