The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::YALI::Builder - Constructs language models for language identification.

VERSION

version 0.013

SYNOPSIS

This modul creates models for Lingua::YALI::Identifier.

If your texts are from specific domain you can achive better results when your models will be trained on texts from the same domain.

Creating bigram and trigram models from a string.

    use Lingua::YALI::Builder;
    my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3]);
    $builder->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa");
    $builder->train_string("aa aaaaaa aa aaaaa aaaaaa aaaaa");
    $builder->store("model_a.2_4.gz", 2, 4);
    $builder->store("model_a.2_all.gz", 2);
    $builder->store("model_a.3_all.gz", 3);
    $builder->store("model_a.4_all.gz", 4);
    # croaks because 4-grams were not trained

More examples is presented in Lingua::YALI::Examples.

METHODS

BUILD

    BUILD()

Constructs Builder.

    my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4]);

get_ngrams

    my \@ngrams = $builder->get_ngrams()

Returns all n-grams that will be used during training.

    my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4, 2, 3]);
    my $ngrams = $builder->get_ngrams();
    print join(", ", @$ngrams) . "\n";
    # prints out 2, 3, 4

get_max_ngram

    my $max_ngram = $builder->get_max_ngram()

Returns the highest n-gram size that will be used during training.

    my $builder = Lingua::YALI::Builder->new(ngrams=>[2, 3, 4]);
    print $builder->get_max_ngram() . "\n";
    # prints out 4

train_file

    my $used_bytes = $builder->train_file($file)

Uses file $file for training and returns the amount of bytes used.

  • It returns undef if $file is undef.

  • It croaks if the file $file does not exist or is not readable.

  • It returns the amount of bytes used for trainig otherwise.

For more details look at method "train_handle".

train_string

    my $used_bytes = $builder->train_string($string)

Uses string $string for training and returns the amount of bytes used.

  • It returns undef if $string is undef.

  • It returns the amount of bytes used for trainig otherwise.

For more details look at method "train_handle".

train_handle

    my $used_bytes = $builder->train_handle($fh)

Uses file handle $fh for training and returns the amount of bytes used.

  • It returns undef if $fh is undef.

  • It croaks if the $fh is not file handle.

  • It returns the amount of bytes used for trainig otherwise.

store

    my $stored_count = $builder->store($file, $ngram, $count)

Stores trained model with at most $count $ngram-grams to file $file. If count is not specified all $ngram-grams are stored.

  • It croaks if incorrect parameters are passed or it was not trained.

  • It returns the amount of stored n-grams.

SEE ALSO

AUTHOR

Martin Majlis <martin@majlis.cz>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2012 by Martin Majlis.

This is free software, licensed under:

  The (three-clause) BSD License