The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::AM::Batch - Classify items in batch mode

VERSION

version 3.02

SYNOPSIS

  use Algorithm::AM;
  use Algorithm::AM::Batch;
  my $dataset = dataset_from_file('finnverb');
  my $batch = Algorithm::AM::Batch->new(
    training_set => $dataset,
    # print the result of each classification as they are provided
    end_test_hook => sub {
      my ($batch, $test_item, $result) = @_;
      print $test_item->comment . ' ' . $result->result . "\n";
    }
  );
  my @results = $batch->classify_all($dataset);

DESCRIPTION

Batch provides a way to classify entire data sets by repeatedly calling classify with the provided configuration. Hooks are also provided so that the training set and classification parameters can be changed over time. All of the action happens in "classify_all".

EXPORTS

When this module is imported, it also imports the following:

Algorithm::AM
Algorithm::AM::Result
Algorithm::AM::DataSet

Also imports the "dataset_from_file" in Algorithm::AM::DataSet function.

Algorithm::AM::DataSet::Item

Also imports the "new_item" in Algorithm::AM::DataSet::Item function.

Algorithm::AM::BigInt

Also imports the "bigcmp" in Algorithm::AM::BigInt function.

METHODS

new

Creates a new object instance. This method takes named parameters which call the methods described in the relevant documentation sections. The only required parameter is "training_set", which should be an instance of Algorithm::AM::DataSet, and which provides a pool of items to be used for training during classification. All of the accepted parameters are listed below:

"training_set"
"repeat"
"probability"
"max_training_items"
"exclude_nulls"
"exclude_given"
"linear"

training_set

Returns the dataset used for training.

test_set

Returns the test set currently providing the source of items to "classify_all". Before and after classify_all, this returns undef, and so is only useful when called from inside one of the hook subroutines.

repeat

Determines how many times each individual test item will be analyzed. As the analogical modeling algorithm is deterministics, it only makes sense to use this if the training set is modifed somehow during each iteration, i.e. via "probability" or "training_item_hook". The default value is 1.

probability

Get/set the probabibility that any one training item would be included among the training items used during classification, which is 1 by default.

max_training_items

Get/set the maximum number of items considered for addition to the training set. Note that this is the number considered, not actually added, so combined with "probability" or /training_item_hook your training set could be smaller than the amount specified.

exclude_nulls

This is passed directly to the new method of Algorithm::AM during each classification in the "classify_all" method.

exclude_given

This is passed directly to the new method of Algorithm::AM during each classification in the "classify_all" method.

linear

This is passed directly to the new method of Algorithm::AM during each classification in the "classify_all" method.

classify_all

Using the analogical modeling algorithm, this method classifies the test items in the project and returns a list of Result objects.

Log::Any is used to log information about the current progress and timing. The statistical summary, analogical set, and gang summary (without items listed) are logged at the info level, and the full gang summary with items listed is logged at the debug level.

Hooks are provided to the user for monitoring or modifying classification configuration. These hooks may be passed into the object constructor or set via one of the accessor methods. Batch classification proceeds as follows:

  call begin_hook
  loop all test set items
    call begin_test_hook
    repeat X times, where X is specified by the "repeat" setting
      call begin_repeat_hook
      create a training set;
          - for each item in the provided training set,
          up to max_training_items
        exclude the item with probability 1 - probability
        exclude the item if specified via training_item_hook
      classify the item with the given training set
      call end_repeat_hook
    call end_test_hook
  call end_hook

The Batch object itself is passed to these hooks, so the user is free to change settings such as "probability" or "max_training_items", or even add training data, at any point. Other information is passed to these hooks as well, as detailed in the method documentation.

begin_hook

  $batch->begin_hook(sub {
    my ($batch) = @_;
    $batch->probability(.5);
  });

This hook is called first thing in the "classify_all" method, and is given the Batch object instance.

begin_test_hook

  $batch->begin_repeat_hook(sub {
    my ($batch, $test_item) = @_;
    $batch->probability(.5);
    print $test_item->comment . "\n";
  });

This hook is called by "classify_all" before any iterations of classification start for each test item. It is provided with the Batch object instance and the test item.

begin_repeat_hook

  $batch->begin_repeat_hook(sub {
    my ($batch, $test_item, $iteration) = @_;
    $batch->probability(.5);
    print $test_item->comment . "\n";
    print "I'm on iteration $iteration\n";
  });

This hook is called during "classify_all" at the beginning of each iteration of classification of a test item. It is provided with the Batch object instance, the test item, and the iteration number, which will vary between 1 and the setting for "repeat".

training_item_hook

  $batch->begin_repeat_hook(sub {
    my ($batch, $test_item, $iteration, $training_item) = @_;
    $batch->probability(.5);
    print $test_item->comment . "\n";
    print "I'm on iteration $iteration\n";
    if($training_item->comment eq 'include me!'){
      return 1;
    }else{
      return 0;
    }
  });

This hook is called by "classify_all" while populating a training set during each iteration of classification. It is provided with the Batch object instance, the test item, the iteration number, and an item which may be included in the training set. If the return value is true, then the item will be included in the training set; otherwise, it will not.

end_repeat_hook

  $batch->begin_repeat_hook(sub {
    my ($batch, $test_item, $iteration, $excluded_items, $result) = @_;
    $batch->probability(.5);
    print $test_item->comment . "\n";
    print "I finished iteration $iteration\n";
    print 'I excluded ' . scalar @$excluded_items .
      " items from training\n";
    print ${$result->statistical_summary};
  });

This hook is called during "classify_all" at the end of each iteration of classification of a test item. It is provided with the Batch object instance, the test item, the iteration number, an array ref containing training items excluded from the training set, and the result object returned by classify.

end_test_hook

  $batch->begin_repeat_hook(sub {
    my ($batch, $test_item, @results) = @_;
    $batch->probability(.5);
    print $test_item->comment . "\n";
    my $iterations = @results;
    my $correct = 0;
    for my $result (@result){
      $correct++ if $result->result ne 'incorrect';
    }
    print 'Item ' . $item->comment .
      " correct $correct/$iterations times\n";
  });

This hook is called by "classify_all" after all classifications of a single item are finished. It is provided with the Batch object instance as well as a list of the Result objects returned by "classify" in Algorithm::AM during each iteration of classification.

end_hook

  $batch->end_hook(sub {
    my ($batch, @results) = @_;
    for my $result(@results){
      print ${$result->statistical_summary};
    }
  });

This hook is called after all classifications are finished. It is provided with the Batch object instance as well as a list of all of the Result objects returned by "classify" in Algorithm::AM.

AUTHOR

Theron Stanford <shixilun@yahoo.com>, Nathan Glenn <garfieldnate@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Royal Skousen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.