The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::AM::DataSet - Manage data used by Algorithm::AM

VERSION

version 3.12

SYNOPSIS

 use Algorithm::AM::DataSet 'dataset_from_file';
 use Algorithm::AM::DataSet::Item 'new_item';
 my $dataset = Algorithm::AM::DataSet->new(cardinality => 10);
 # or
 $dataset = dataset_from_file(path => 'finnverb', format => 'nocommas');
 $dataset->add_item(
   new_item(features => [qw(a b c d e f g h i)]));
 my $item = $dataset->get_item(2);

DESCRIPTION

This package contains a list of items that can be used by Algorithm::AM or Algorithm::AM::Batch for classification. DataSets can be made one item at a time via the "add_item" method, or they can be read from files via the "dataset_from_file" function.

new

Creates a new DataSet object. You must provide a cardinality argument indicating the number of features to be contained in each data vector. You can then add items via the add_item method. Each item will contain a feature vector, and also optionally a class label and a comment (also called a "spec").

cardinality

Returns the number of features contained in the feature vector of a single item.

size

Returns the number of items in the data set.

classes

Returns the list of all unique class labels in the data set.

add_item

Adds a new item to the data set. The input may be either an Algorithm::AM::DataSet::Item object, or the arguments to create one via its constructor (features, class, comment). This method will croak if the cardinality of the item does not match "cardinality".

get_item

Return the item at the given index. This will be a Algorithm::AM::DataSet::Item object.

num_classes

Returns the number of different classification labels contained in the data set.

dataset_from_file

This function may be exported. Given 'path' and 'format' arguments, it reads a file containing a dataset and returns a new DataSet object with the given data. The 'path' argument should be the path to the file. The 'format' argument should be 'commas' or 'nocommas', indicating one of the following formats. You may also specify 'unknown' and 'null' arguments to indicate the strings meant to represent an unknown class value and null feature values. By default these are 'UNK' and '='.

The 'commas' file format is shown below:

 class , f eat u re s , your comment here

The commas separate the class label, feature values, and comments, and the whitespace around the commas is optional. Each feature value is separated with whitespace.

The 'nocommas' file format is shown below:

 class   features  your comment here

Here the class, feature values, and comments are separated by whitespace. Each feature value must be a single character with no separating characters, so here the features are f, e, a, t, u, r, e, and s.

Lines beginning with a pound character (#) are ignored.

SEE ALSO

For information on creating data sets, see the appendices in the "red book", Analogical Modeling: An exemplar-based approach to language. See also the "green book", Analogical Modeling of Language, for an explanation of the method in general, and the "blue book", Analogy and Structure, for its mathematical basis.

AUTHOR

Theron Stanford <shixilun@yahoo.com>, Nathan Glenn <garfieldnate@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2021 by Royal Skousen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.