NAME

AI::NaiveBayes - A Bayesian classifier

VERSION

version 0.04

SYNOPSIS

    # AI::NaiveBayes objects are created by AI::NaiveBayes::Learner
    # but for quick start you can use the 'train' class method
    # that is a shortcut using default AI::NaiveBayes::Learner settings

    my $classifier = AI::NaiveBayes->train( 
        {
            attributes => {
                sheep => 1, very => 1,  valuable => 1, farming => 1
            },
            labels => ['farming']
        },
        {
            attributes => {
                vampires => 1, cannot => 1, see => 1, their => 1,
                images => 1, mirrors => 1
            },
            labels => ['vampire']
        },
    );

    # Classify a feature vector
    my $result = $classifier->classify({bar => 3, blurp => 2});
    
    # $result is now a AI::NaiveBayes::Classification object
    
    my $best_category = $result->best_category;

DESCRIPTION

This module implements the classic "Naive Bayes" machine learning algorithm. This is a low level class that accepts only pre-computed feature-vectors as input, see AI::Classifier::Text for a text classifier that uses this class.

Creation of AI::NaiveBayes classifier object out of training data is done by AI::NaiveBayes::Learner. For quick start you can use the limited train class method that trains the classifier in a default way.

The classifier object is immutable.

It is a well-studied probabilistic algorithm often used in automatic text categorization. Compared to other algorithms (kNN, SVM, Decision Trees), it's pretty fast and reasonably competitive in the quality of its results.

A paper by Fabrizio Sebastiani provides a really good introduction to text categorization: http://faure.iei.pi.cnr.it/~fabrizio/Publications/ACMCS02.pdf

METHODS

new( model => $model )

Internal. See AI::NaiveBayes::Learner to learn how to create a AI::NaiveBayes classifier from training data.

train( LIST of HASHREFS )

Shortcut for creating a trained classifier using AI::NaiveBayes::Learner default settings. Arguments are passed to the add_example method of the AI::NaiveBayes::Learner object one by one.

classify( HASHREF )

Classifies a feature-vector of the form:

    { feature1 => weight1, feature2 => weight2, ... }

The result is a AI::NaiveBayes::Classification object.

rescale

Internal

ATTRIBUTES

model: Internal

THEORY

Bayes' Theorem is a way of inverting a conditional probability. It states:

    P(y|x) P(x)
        P(x|y) = -------------
    P(y)

The notation P(x|y) means "the probability of x given y." See also "/mathforum.org/dr.math/problems/battisfore.03.22.99.html"" in "http: for a simple but complete example of Bayes' Theorem.

In this case, we want to know the probability of a given category given a certain string of words in a document, so we have:

    P(words | cat) P(cat)
        P(cat | words) = --------------------
    P(words)

We have applied Bayes' Theorem because P(cat | words) is a difficult quantity to compute directly, but P(words | cat) and P(cat) are accessible (see below).

The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as

    P(words | cat) P(cat)
        Best category =   ArgMax      -----------------------
    cat in cats          P(words)

Since P(words) doesn't change over the range of categories, we can get rid of it. That's good, because we didn't want to have to compute these values anyway. So our new formula is:

    Best category =   ArgMax      P(words | cat) P(cat)
        cat in cats

Finally, we note that if w1, w2, ... wn are the words in the document, then this expression is equivalent to:

    Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
        cat in cats

That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the "naive" part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word "iodized" is far more likely to appear in a document that contains the word "salt" than it is to appear in a document that contains the word "subroutine". Luckily, as it turns out, making this assumption even when it isn't true may have little effect on our results, as the following paper by Pedro Domingos argues: "/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz"" in "http:

BASED ON

Much of the code and description is from Algorithm::NaiveBayes.

AUTHORS

Zbigniew Lukasiak <zlukasiak@opera.com>
Tadeusz Sośnierz <tsosnierz@opera.com>
Ken Williams <ken@mathforum.org>

COPYRIGHT AND LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

To install AI::NaiveBayes, copy and paste the appropriate command in to your terminal.

cpanm

cpanm AI::NaiveBayes

CPAN shell

perl -MCPAN -e shell
install AI::NaiveBayes

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)