NAME

Lingua::YALI::Examples - Examples of usages.

VERSION

version 0.009_01

Introduction

Preparation

Download training and test data

    # download data
    for i in `seq 1 20`; do
        id=`printf "%02d" $i`;
        echo "Processing document $id";
        lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt;
        lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt;
        lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt;
    done;

    # prepare training data
    ls ces.* | head -n15 > list.ces.train;
    ls eng.* | head -n15 > list.eng.train;
    ls fra.* | head -n15 > list.fra.train;
    
    # prepare testing data
    ls ces.* | tail -n5 > list.ces.test;
    ls eng.* | tail -n5 > list.eng.test;
    ls fra.* | tail -n5 > list.fra.test;

Scripts

This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.

Language Identification with Pretrained Models

    # language identification for czech files
    yali-language-identifier -l="eng ces fra" -filelist=list.ces.test
    
    # language identification for english files with different output format
    yali-language-identifier -l="eng ces fra" -filelist=list.eng.test -f=all_p
    
    # language identification for french files read from STDIN
    cat list.fra.test | yali-language-identifier -l="eng ces fra" -filelist=- -f=tabbed
    
    # single file
    yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all
    
    # single file read from STDIN
    cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p

    # single file read from STDIN
    cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p

Building Own Models

    # czech bigram model with only 5 most frequent bigrams stored    
    yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz
    
    # english bigram model with only 5 most frequent bigrams stored    
    cat list.eng.train | yali-builder --filelist=- -n=2 -c=5 -o model.2.5.eng.gz
    
    # french bigram model with only 5 most frequent bigrams stored
    cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz
    
    # create list with models
    echo -e "ces\tmodel.2.5.ces.gz" > list.models.2
    echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2
    echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2

Language Identification with Own Models

Only two changes are required to commands presented in section "Language Identification with Pretrained Models".

Change yali-language-identifier to yali-identifier.
Change -l="eng ces fra" to -c=list.models.2.

    # language identification for czech files
    yali-identifier -c=list.models.2 -filelist=list.ces.test
    
    # language identification for english files with different output format
    yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p
    
    # language identification for french files read from STDIN
    cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed
    
    # single file
    yali-identifier -c=list.models.2 -i=ces.20.txt -f=all
    
    # single file read from STDIN
    cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p

    # single file read from STDIN
    cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p

Modules

Language Identification

    use Lingua::YALI::LanguageIdentifier;
    
    // create identifier and register languages
    my $identifier = Lingua::YALI::LanguageIdentifier->new();
    $identifier->add_language("ces", "eng")
    
    // identify string
    my $result = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");
    print "The most probable language is " . $result->[0]->[0] . ".\n";
    // prints out The most probable language is eng.

Training models

    use Lingua::YALI::Builder;
    use Lingua::YALI::Identifier;
    
    // create models
    my $builder_a = Lingua::YALI::Builder->new(ngrams=>[2]);
    $builder_a->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa");
    $builder_a->store("model_a.2_all.gz", 2);

    my $builder_b = Lingua::YALI::Builder->new(ngrams=>[2]);
    $builder_b->train_string("bbbbbb bbbb bbbb bbb bbbb bbbb bbb");
    $builder_b->store("model_b.2_all.gz", 2);

    // create identifier and load models
    my $identifier = Lingua::YALI::Identifier->new();
    $identifier->add_class("a", "model_a.2_all.gz");
    $identifier->add_class("b", "model_b.2_all.gz");

    // identify strings
    my $result1 = $identifier->identify_string("aaaaaaaaaaaaaaaaaaa");
    print $result1->[0]->[0] . "\t" . $result1->[0]->[1];
    // prints out a 1
    
    my $result2 = $identifier->identify_string("bbbbbbbbbbbbbbbbbbb");
    print $result2->[0]->[0] . "\t" . $result2->[0]->[1];
    // prints out b 1

AUTHOR

Martin Majlis <martin@majlis.cz>

COPYRIGHT AND LICENSE

This is free software, licensed under:

  The (three-clause) BSD License

To install Lingua::YALI, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Lingua::YALI

CPAN shell

perl -MCPAN -e shell
install Lingua::YALI

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)