Lingua::YALI::Builder - Constructs language models for language identification.
version 0.010_04
This modul creates models for Lingua::YALI::Identifier.
If your texts are from specific domain you can achive better results when your models will be trained on texts from the same domain.
Creating bigram and trigram models from a string.
use Lingua::YALI::Builder; my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3]); $builder->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa"); $builder->train_string("aa aaaaaa aa aaaaa aaaaaa aaaaa"); $builder->store("model_a.2_4.gz", 2, 4); $builder->store("model_a.2_all.gz", 2); $builder->store("model_a.3_all.gz", 3); $builder->store("model_a.4_all.gz", 4); # croaks because 4-grams were not trained
More examples is presented in Lingua::YALI::Examples.
BUILD()
Constructs Builder.
Builder
my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4]);
my \@ngrams = $builder->get_ngrams()
Returns all n-grams that will be used during training.
my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4, 2, 3]); my $ngrams = $builder->get_ngrams(); print join(", ", @$ngrams) . "\n"; # prints out 2, 3, 4
my $max_ngram = $builder->get_max_ngram()
Returns the highest n-gram size that will be used during training.
my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4]); print $builder->get_max_ngram() . "\n"; # prints out 4
my $used_bytes = $builder->train_file($file)
Uses file $file for training and returns the amount of bytes used.
$file
It returns undef if $file is undef.
It croaks if the file $file does not exist or is not readable.
It returns the amount of bytes used for trainig otherwise.
For more details look at method "train_handle".
my $used_bytes = $builder->train_string($string)
Uses string $string for training and returns the amount of bytes used.
$string
It returns undef if $string is undef.
my $used_bytes = $builder->train_handle($fh)
Uses file handle $fh for training and returns the amount of bytes used.
$fh
It returns undef if $fh is undef.
It croaks if the $fh is not file handle.
my $stored_count = $builder->store($file, $ngram, $count)
Stores trained model with at most $count $ngram-grams to file $file. If count is not specified all $ngram-grams are stored.
$count
$ngram
It croaks if incorrect parameters are passed or it was not trained.
It returns the amount of stored n-grams.
Trained models are suitable for Lingua::YALI::Identifier.
There is also command line tool yali-builder with similar functionality.
Source codes are available at https://github.com/martin-majlis/YALI.
Martin Majlis <martin@majlis.cz>
This software is Copyright (c) 2012 by Martin Majlis.
This is free software, licensed under:
The (three-clause) BSD License
To install Lingua::YALI, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::YALI
CPAN shell
perl -MCPAN -e shell install Lingua::YALI
For more information on module installation, please visit the detailed CPAN module installation guide.