The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

AI::Calibrate - Perl module for producing probabilities from classifier scores

SYNOPSIS

  use AI::Calibrate ':all';
  ... train a classifier ...
  ... test classifier on $points ...
  $calibrated = calibrate($points);

DESCRIPTION

Classifiers usually return some sort of an instance score with their classifications. These scores can be used as probabilities in various calculations, but first they need to be calibrated. Naive Bayes, for example, is a very useful classifier, but the scores it produces are usually "bunched" around 0 and 1, making these scores poor probability estimates. Support vector machines have a similar problem. Both classifier types should be calibrated before their scores are used as probability estimates.

This module calibrates classifier scores using a method called the Pool Adjacent Violators (PAV) algorithm. After you train a classifier, you take a (usually separate) set of test instances and run them through the classifier, collecting the scores assigned to each. You then supply this set of instances to the calibrate function defined here, and it will return a set of ranges mapping from a score range to a probability estimate.

For example, assume you have the following set of instance results from your classifier. Each result is of the form [ASSIGNED_SCORE, TRUE_CLASS]:

 my $points = [
              [.9, 1],
              [.8, 1],
              [.7, 0],
              [.6, 1],
              [.55, 1],
              [.5, 1],
              [.45, 0],
              [.4, 1],
              [.35, 1],
              [.3, 0 ],
              [.27, 1],
              [.2, 0 ],
              [.18, 0],
              [.1, 1 ],
              [.02, 0]
             ];

If you then call calibrate($points), it will return this structure:

 [
   [.9,    1 ],
   [.7,  3/4 ],
   [.45, 2/3 ],
   [.3,  1/2 ],
   [.2,  1/3 ],
   [.02,   0 ]
  ]

This means that, given a SCORE produced by the classifier, you can map the SCORE onto a probability like this:

               SCORE >= .9        prob = 1
         .9  > SCORE >= .7        prob = 3/4
         .7  > SCORE >= .45       prob = 2/3
         .45 > SCORE >= .3        prob = 3/4
         .2  > SCORE >= .7        prob = 3/4
         .02 > SCORE              prob = 0

For a realistic example of classifier calibration, see the test file t/AI-Calibrate-NB.t, which uses the AI::NaiveBayes1 module to train a Naive Bayes classifier then calibrates it using this module.

FUNCTIONS

calibrate

This is the main calibration function. The calling form is:

my $calibrated = calibrate( $data, $sorted);

$data looks like: [ [score, class], [score, class], [score, class]...] Each score is a number. Each class is either 0 (negative class) or 1 (positive class).

$sorted is boolean (0 by default) indicating whether the data are already sorted by score. Unless this is set to 1, calibrate() will sort the data itself.

Calibrate returns a reference to an ordered list of references:

  [ [score, prob], [score, prob], [score, prob] ... ]

Scores will be in descending numerical order. See the DESCRIPTION section for how this structure is interpreted. You can pass this structure to the score_prob function, along with a new score, to get a probability.

score_prob

This is a simple utility function that takes the structure returned by calibrate, along with a new score, and returns the probability estimate. Example calling form:

  $p = score_prob($calibrated, $score);

Once you have a trained, calibrated classifier, you could imagine using it like this:

 $calibrated = calibrate( $calibration_set );
 print "Input instances, one per line:\n";
 while (<>) {
    chomp;
    my(@fields) = split;
    my $score = classifier(@fields);
    my $prob = score_prob($score);
    print "Estimated probability: $prob\n";
 }

This is a simple utility function that takes the structure returned by calibrate and prints out a simple list of lines describing the mapping created.

Example calling form:

  print_mapping($calibrated);

Sample output:

  1.00 > SCORE >= 1.00     prob = 1.000
  1.00 > SCORE >= 0.71     prob = 0.667
  0.71 > SCORE >= 0.39     prob = 0.000
  0.39 > SCORE >= 0.00     prob = 0.000

These ranges are not necessarily compressed/optimized, as this sample output shows.

DETAILS

The PAV algorithm is conceptually straightforward. Given a set of training cases ordered by the scores assigned by the classifier, it first assigns a probability of one to each positive instance and a probability of zero to each negative instance, and puts each instance in its own group. It then looks, at each iteration, for adjacent violators: adjacent groups whose probabilities locally increase rather than decrease. When it finds such groups, it pools them and replaces their probability estimates with the average of the group's values. It continues this process of averaging and replacement until the entire sequence is monotonically decreasing. The result is a sequence of instances, each of which has a score and an associated probability estimate, which can then be used to map scores into probability estimates.

For further information on the PAV algorithm, you can read the section in my paper referenced below.

EXPORT

This module exports three functions: calibrate, score_prob and print_mapping.

BUGS

None known. This implementation is straightforward but inefficient (its time is O(n^2) in the length of the data series). A linear time algorithm is known, and in a later version of this module I'll probably implement it.

SEE ALSO

The AI::NaiveBayes1 perl module.

My paper "PAV and the ROC Convex Hull" has a good discussion of the PAV algorithm, including examples: http://home.comcast.net/~tom.fawcett/public_html/papers/PAV-ROCCH-dist.pdf

If you want to read more about the general issue of classifier calibration, here are some good papers, which are freely available on the web:

"Transforming classifier scores into accurate multiclass probability estimates" by Bianca Zadrozny and Charles Elkan

"Predicting Good Probabilities With Supervised Learning" by A. Niculescu-Mizil and R. Caruana

AUTHOR

Tom Fawcett, <tom.fawcett@gmail.com>

COPYRIGHT AND LICENSE

Copyright (C) 2008-2012 by Tom Fawcett

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.