Lingua::YALI::Examples - Examples of usages.
version 0.009_01
Download training and test data
# download data for i in `seq 1 20`; do id=`printf "%02d" $i`; echo "Processing document $id"; lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt; lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt; lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt; done; # prepare training data ls ces.* | head -n15 > list.ces.train; ls eng.* | head -n15 > list.eng.train; ls fra.* | head -n15 > list.fra.train; # prepare testing data ls ces.* | tail -n5 > list.ces.test; ls eng.* | tail -n5 > list.eng.test; ls fra.* | tail -n5 > list.fra.test;
This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.
# language identification for czech files yali-language-identifier -l="eng ces fra" -filelist=list.ces.test # language identification for english files with different output format yali-language-identifier -l="eng ces fra" -filelist=list.eng.test -f=all_p # language identification for french files read from STDIN cat list.fra.test | yali-language-identifier -l="eng ces fra" -filelist=- -f=tabbed # single file yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all # single file read from STDIN cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p # single file read from STDIN cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p
# czech bigram model with only 5 most frequent bigrams stored yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz # english bigram model with only 5 most frequent bigrams stored cat list.eng.train | yali-builder --filelist=- -n=2 -c=5 -o model.2.5.eng.gz # french bigram model with only 5 most frequent bigrams stored cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz # create list with models echo -e "ces\tmodel.2.5.ces.gz" > list.models.2 echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2 echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2
Only two changes are required to commands presented in section "Language Identification with Pretrained Models".
Change yali-language-identifier to yali-identifier.
Change -l="eng ces fra" to -c=list.models.2.
# language identification for czech files yali-identifier -c=list.models.2 -filelist=list.ces.test # language identification for english files with different output format yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p # language identification for french files read from STDIN cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed # single file yali-identifier -c=list.models.2 -i=ces.20.txt -f=all # single file read from STDIN cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p # single file read from STDIN cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p
use Lingua::YALI::LanguageIdentifier; // create identifier and register languages my $identifier = Lingua::YALI::LanguageIdentifier->new(); $identifier->add_language("ces", "eng") // identify string my $result = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl."); print "The most probable language is " . $result->[0]->[0] . ".\n"; // prints out The most probable language is eng.
use Lingua::YALI::Builder; use Lingua::YALI::Identifier; // create models my $builder_a = Lingua::YALI::Builder->new(ngrams=>[2]); $builder_a->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa"); $builder_a->store("model_a.2_all.gz", 2); my $builder_b = Lingua::YALI::Builder->new(ngrams=>[2]); $builder_b->train_string("bbbbbb bbbb bbbb bbb bbbb bbbb bbb"); $builder_b->store("model_b.2_all.gz", 2); // create identifier and load models my $identifier = Lingua::YALI::Identifier->new(); $identifier->add_class("a", "model_a.2_all.gz"); $identifier->add_class("b", "model_b.2_all.gz"); // identify strings my $result1 = $identifier->identify_string("aaaaaaaaaaaaaaaaaaa"); print $result1->[0]->[0] . "\t" . $result1->[0]->[1]; // prints out a 1 my $result2 = $identifier->identify_string("bbbbbbbbbbbbbbbbbbb"); print $result2->[0]->[0] . "\t" . $result2->[0]->[1]; // prints out b 1
Martin Majlis <martin@majlis.cz>
This software is Copyright (c) 2012 by Martin Majlis.
This is free software, licensed under:
The (three-clause) BSD License
To install Lingua::YALI, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::YALI
CPAN shell
perl -MCPAN -e shell install Lingua::YALI
For more information on module installation, please visit the detailed CPAN module installation guide.