The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::YALI::Examples - Examples of usages.

VERSION

version 0.009_01

Introduction

Preparation

Download training and test data

    # download data
    for i in `seq 1 20`; do
        id=`printf "%02d" $i`;
        echo "Processing document $id";
        lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt;
        lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt;
        lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt;
    done;

    # prepare training data
    ls ces.* | head -n15 > list.ces.train;
    ls eng.* | head -n15 > list.eng.train;
    ls fra.* | head -n15 > list.fra.train;
    
    # prepare testing data
    ls ces.* | tail -n5 > list.ces.test;
    ls eng.* | tail -n5 > list.eng.test;
    ls fra.* | tail -n5 > list.fra.test;

Scripts

This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.

Language Identification with Pretrained Models

    # language identification for czech files
    yali-language-identifier -l="eng ces fra" -filelist=list.ces.test
    
    # language identification for english files with different output format
    yali-language-identifier -l="eng ces fra" -filelist=list.eng.test -f=all_p
    
    # language identification for french files read from STDIN
    cat list.fra.test | yali-language-identifier -l="eng ces fra" -filelist=- -f=tabbed
    
    # single file
    yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all
    
    # single file read from STDIN
    cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p

    # single file read from STDIN
    cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p

Building Own Models

    # czech bigram model with only 5 most frequent bigrams stored    
    yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz
    
    # english bigram model with only 5 most frequent bigrams stored    
    cat list.eng.train | yali-builder --filelist=- -n=2 -c=5 -o model.2.5.eng.gz
    
    # french bigram model with only 5 most frequent bigrams stored
    cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz
    
    # create list with models
    echo -e "ces\tmodel.2.5.ces.gz" > list.models.2
    echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2
    echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2

Language Identification with Own Models

Only two changes are required to commands presented in section "Language Identification with Pretrained Models".

  • Change yali-language-identifier to yali-identifier.

  • Change -l="eng ces fra" to -c=list.models.2.

    # language identification for czech files
    yali-identifier -c=list.models.2 -filelist=list.ces.test
    
    # language identification for english files with different output format
    yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p
    
    # language identification for french files read from STDIN
    cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed
    
    # single file
    yali-identifier -c=list.models.2 -i=ces.20.txt -f=all
    
    # single file read from STDIN
    cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p

    # single file read from STDIN
    cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p

Modules

Language Identification

    use Lingua::YALI::LanguageIdentifier;
    
    // create identifier and register languages
    my $identifier = Lingua::YALI::LanguageIdentifier->new();
    $identifier->add_language("ces", "eng")
    
    // identify string
    my $result = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");
    print "The most probable language is " . $result->[0]->[0] . ".\n";
    // prints out The most probable language is eng.    

Training models

    use Lingua::YALI::Builder;
    use Lingua::YALI::Identifier;
    
    // create models
    my $builder_a = Lingua::YALI::Builder->new(ngrams=>[2]);
    $builder_a->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa");
    $builder_a->store("model_a.2_all.gz", 2);

    my $builder_b = Lingua::YALI::Builder->new(ngrams=>[2]);
    $builder_b->train_string("bbbbbb bbbb bbbb bbb bbbb bbbb bbb");
    $builder_b->store("model_b.2_all.gz", 2);

    // create identifier and load models
    my $identifier = Lingua::YALI::Identifier->new();
    $identifier->add_class("a", "model_a.2_all.gz");
    $identifier->add_class("b", "model_b.2_all.gz");

    // identify strings
    my $result1 = $identifier->identify_string("aaaaaaaaaaaaaaaaaaa");
    print $result1->[0]->[0] . "\t" . $result1->[0]->[1];
    // prints out a 1
    
    my $result2 = $identifier->identify_string("bbbbbbbbbbbbbbbbbbb");
    print $result2->[0]->[0] . "\t" . $result2->[0]->[1];
    // prints out b 1

AUTHOR

Martin Majlis <martin@majlis.cz>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2012 by Martin Majlis.

This is free software, licensed under:

  The (three-clause) BSD License