The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::SVMLight - Perl interface to SVMLight Machine-Learning Package

SYNOPSIS

  use Algorithm::SVMLight;
  my $s = new Algorithm::SVMLight;
  
  $s->add_instance
    (attributes => {foo => 1, bar => 1, baz => 3},
     label => 1);
  
  $s->add_instance
    (attributes => {foo => 2, blurp => 1},
     label => -1);
  
  ... repeat for several more instances, then:
  $s->train;

  # Find results for unseen instances
  my $result = $s->predict
    (attributes => {bar => 3, blurp => 2});

DESCRIPTION

This module implements a perl interface to Thorsten Joachims' SVMLight package:

    SVMLight is an implementation of Vapnik's Support Vector Machine [Vapnik, 1995] for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in [Joachims, 2002a ]. [Joachims, 1999a]. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently.

     -- http://svmlight.joachims.org/

Support Vector Machines in general, and SVMLight specifically, represent some of the best-performing Machine Learning approaches in domains such as text categorization, image recognition, bioinformatics string processing, and others.

For efficiency reasons, the underlying SVMLight engine indexes features by integers, not strings. Since features are commonly thought of by name (e.g. the words in a document, or mnemonic representations of engineered features), we provide in Algorithm::SVMLight a simple mechanism for mapping back and forth between feature names (strings) and feature indices (integers). If you want to use this mechanism, use the add_instance() and predict() methods. If not, use the add_instance_i() (or read_instances()) and predict_i() methods.

INSTALLATION

For installation instructions, please see the README file included with this distribution.

METHODS

new(...)

Creates a new Algorithm::SVMLight object and returns it. Any named arguments that correspond to SVM parameters will cause their corresponding set_***() method to be invoked:

  $s = Algorithm::SVMLight->new(
         type => 2,              # Regression model
         biased_hyperplane => 0, # Nonbiased
         kernel_type => 3,       # Sigmoid
  );

See the set_***(...) method for a list of such parameters.

set_***(...)

The following parameters can be set by using methods with their corresponding names - for instance, the maxiter parameter can be set by using set_maxiter($x), where $x is the new desired value.

  Learning parameters:
     type
     svm_c
     eps
     svm_costratio
     transduction_posratio
     biased_hyperplane
     sharedslack
     svm_maxqpsize
     svm_newvarsinqp
     kernel_cache_size
     epsilon_crit
     epsilon_shrink
     svm_iter_to_shrink
     maxiter
     remove_inconsistent
     skip_final_opt_check
     compute_loo
     rho
     xa_depth
     predfile
     alphafile

  Kernel parameters:
     kernel_type
     poly_degree
     rbf_gamma
     coef_lin
     coef_const
     custom

For an explanation of these parameters, you may be interested in looking at the svm_common.h file in the SVMLight distribution.

It would be a good idea if you only set these parameters via arguments to new() (see above) or right after calling new(), since I don't think the underlying C code expects them to change in the middle of a process.

add_instance(label => $x, attributes => \%y)

Adds a training instance to the set of instances which will be used to train the model. An attributes parameter specifies a hash of attribute-value pairs for the instance, and a label parameter specifies the label. The label must be a number, and typically it should be 1 for positive training instances and -1 for negative training instances. The keys of the attributes hash should be strings, and the values should be numbers (the values of each attribute).

All training instances share the same attribute-space; if an attribute is unspecified for a certain instance, it is equivalent to specifying a value of zero. Typically you can save a lot of memory (and potentially training time) by omitting zero-valued attributes.

Each training instance may have a "cost factor" assigned to it, indicating the relative cost of misclassification of the instance. The default is a cost of 1.0; to assign a different cost, pass a cost_factor parameter with the desired value.

When using a ranking SVM, you may also pass a query_id parameter, whose integer value will identify the group of instances in which this instance belongs for ranking purposes.

Finally, a slack_id parameter may also be passed and it will become the slackid member of the underlying DOC C struct, used in an "OPTIMIZATION" SVM (type==4).

add_instance_i($label, $name, \@indices, \@values, $query_id=0, $slack_id=0, $cost_factor=1.0)

This is just like add_instance(), but bypasses all the string-to-integer mapping of feature names. Use this method when you already have your features represented as integers. The $label parameter must be a number (typically 1 or -1), and the @indices and @values arrays must be parallel arrays of indices and their corresponding values. Furthermore, the indices must be positive integers and given in strictly increasing order.

If you like add_instance_i(), I've got a predict_i() I bet you'll just love.

read_instances($file)

An alternative to calling add_instance_i() for each instance is to organize a collection of training data into SVMLight's standard "example_file" format, then call this read_instances() method to import the data. Under the hood, this calls SVMLight's read_documents() C function. When it's convenient for you to organize the data in this manner, you may see speed improvements.

ranking_callback(\&function)

When using a ranking SVM, it is possible to customize the cost of ranking each pair of instances incorrectly by supplying a custom Perl callback function.

For two instances i and j, the custom function will receive four arguments: the rankvalue of instance i and j, and the costfactor of instance i and j. It should return a real number indicating the cost.

By default, SVMLight will use an internal C function assigning a cost of the average of the costfactors for the two instances.

train()

After a sufficient number of instances have been added to your model, call train() in order to actually learn the underlying discriminative Machine Learning model.

Depending on the number of instances (and to a lesser extent the total number of attributes), this method might take a while. If you want to train the model only once and save it for later re-use in a different context, see the write_model() and read_model() methods.

is_trained()

Returns a boolean value indicating whether or not train() has been called on this model.

predict(attributes => \%y)

After train() has been called, the model may be applied to previously-unseen combinations of attributes. The predict() method accepts an attributes parameter just like add_instance(), and returns its best prediction of the label that would apply to the given attributes. The sign of the returned label (positive or negative) indicates whether the new instance is considered a positive or negative instance, and the magnitude of the label corresponds in some way to the confidence with which the model is making that assertion.

predict_i(\@indices, \@values)

This is just like predict(), but bypasses all the string-to-integer mapping of feature names. See also add_instance_i().

write_model($file)

Saves the given trained model to the file $file. The model may later be re-loaded using the read_model() method. The model is written using SVMLight's write_model() C function, so it will be fully compatible with SVMLight command-line tools like svm_classify.

read_model($file)

Reads a model that has previously been written with write_model():

  my $m = Algorithm::SVMLight->new();
  $m->read_model($file);

The model file is read using SVMLight's read_model() C function, so if you want to, you could initially create the model with one of SVMLight's command-line tools like svm_learn.

get_linear_weights()

After training a linear model (or reading in a model file), this method will return a reference to an array containing the linear weights of the model. This can be useful for model inspection, to see which features are having the greatest impact on decision-making.

 my $arrayref = $m->get_linear_weights();

The first element (position 0) of the array will be the threshold b, and the rest of the elements will be the weights themselves. Thus from 1 upward, the indices align with SVMLight's internal indices.

If the model has not yet been trained, or if the kernel type is not linear, an exception will be thrown.

feature_names()

Returns a list of feature names that have been fed to add_instance() as keys of the attribute parameter, or in a scalar context the number of such names.

num_features()

Returns the number of features known to this model. Note that if you use add_instance_i() or read_instances(), some of the features may never actually have been seen before, because you could add instances with only indices 2, 5, and 37, never having added any instances with the indices in between, but num_features() will return 37 in this case. This is because after training, an instance could be passed to the predict() method with real values for these previously unseen features. If you just use add_instance() instead, you'll probably never run into this issue, and in a scalar context num_features() will look just like feature_names().

num_instances()

Returns the number of training instances known to the model. It should be fine to call this method either before or after training actually occurs.

SEE ALSO

Algorithm::NaiveBayes, AI::DecisionTree

http://svmlight.joachims.org/

AUTHOR

Ken Williams, <kwilliams@cpan.org>

COPYRIGHT AND LICENSE

The Algorithm::SVMLight perl interface is copyright (C) 2005-2008 Thomson Legal & Regulatory, and written by Ken Williams. It is free software; you can redistribute it and/or modify it under the same terms as perl itself.

Thorsten Joachims and/or Cornell University of Ithaca, NY control the copyright of SVMLight itself - you will find full copyright and license information in its distribution. You are responsible for obtaining an appropriate license for SVMLight if you intend to use Algorithm::SVMLight. In particular, please note that SVMLight "is granted free of charge for research and education purposes. However you must obtain a license from the author to use it for commercial purposes."

To avoid any copyright clashes, the SVMLight.patch file distributed here is granted under the same license terms as SVMLight itself.