Text::AI::CRM114 - Perl interface for CRM114
use Text::AI::CRM114; my $db = Text::AI::CRM114->new( classes => ["Alice", "Macbeth"] ); $db->learn("Alice", "Alice was beginning to ..."); $db->learn("Macbeth", "When shall we three meet again ..."); my @ret = $db->classify("The Mole had been working very hard all the morning ..."); say "Best classification is $ret" unless ($ret != Text::AI::CRM114::OK);
This module provides a simple Perl interface to
libcrm114, a library that implements several text classification algorithms.
libcrm114 uses several constants as status return values and to set the classification algorithm of a new datablock. -- These constants are accessible in this module's namespace, for example
Creates a new instance. Options and their default values are:
sets the classification algorithm, recommended values are
libcrm114 includes some more algorithms (SVM, PCA, FSCM) which may or may not be production ready.
the intended memory size for learned data.
Note that this parameter has no immediate effect!
libcrm114 always creates its data structure with a default size (depending on the algorithm, 8M for OSB); this parameter is only used for some algorithms that might grow their dataset after learning many items.
a list of classes passed by reference.
Creates a new instance by reading a previously saved CRM114 DB from
Returns a hash reference to the DB's classes. This hash associates the class names (keys) with the internal integer index (values).
Writes the DB into a (binary) file.
Learn some text of a given class.
Classify the text.
The normal mode (without the optional
$verbatim flag) adjusts the return values to be useful with two classes (e.g. spam/ham).
$verbatim flag is true, then the values are passed unchanged as they come from
libcrm114. See section "Classify Verbatim" below for more details and an example.
Returns a list of five scalar values:
A numeric error code, should be
The name of the best matching class.
The success probability. Normally the probability of the matching class (with 0.5 <= $prob <= 1)
The logarithmic probability ratio i.e.
log10($prob) - log10(1-$prob) (theorethic range is 0 <= $pR <= 340, limited by floating point precision; but in practice a p = .99 yields a pR = 2, so high values are rather unusual).
The following example shows the effect of the
$verbatim flag to
my $db = Text::AI::CRM114->new( classes => ["Macbeth", "Alice"]); $db->learn("Macbeth", SampleText::Macbeth()); $db->learn("Alice", SampleText::Alice()); my @ret = $db->classify(SampleText::Willows_frag(), 1); printf "verbatim mode: err %d, class %s, prob %.3f, pR %.3f\n", @ret; @ret = $db->classify(SampleText::Willows_frag()); printf "normal mode: err %d, class %s, prob %.3f, pR %.3f\n", @ret;
verbatim mode: err 0, class Alice, prob 0.103, pR -0.938 normal mode: err 0, class Alice, prob 0.897, pR 0.938
The background here is that
libcrm114 may use many classes, but on top of that all classes have "success" and "failure" bits (as a meta-category if you will). By default the first class indicates "success" and all other classes are "failures".
This is important because the probability and the probability ratio (pR) of a result is not given relative to a single class, but relative to the "success" meta-category. So in the verbatim mode the probability of 0.103 is obviously not the probability of the best class p(Alice), but it is the probability of "success" p(success). If the first class is the best classification then these numbers are the same (because the class and the meta-category align); they are only different if another class is found to be the best match.
In order to simplify the expected most common usage with two classes, this module inverts the values as needed.
$verbatim flag just provides access to the original values for those who need them. If you use more than two classes then you should look into
libcrm114 for the exact meaning of the result values, and you might want to add accessor methods to set the "success"/"failure" flags for single classes.
This is my first attempt to write a Perl module, so all hints and improvements are appreciated.
I wonder if we should ensure Text::AI::CRM114::OK maps to 0, as this makes the caller's return value checking easier. Currently this is trivial because it already is 0 in
libcrm114. If that should change we would have to insert a rewrite into every XS call to a C function (ugly, but maybe worth it).
I am still not sure if the C memory management works correctly.
Another issue is Unicode support, which is missing in
libcrm114, so it might be a good thing to convert unicode strings into some 8-bit encoding. As long as no string contains \0-values nothing bad[tm] will happen, but I assume that Unicode strings will internally cause wrong tokenization (this should be checked in
CRM114 homepage: http://crm114.sourceforge.net/
Martin Schuette, <firstname.lastname@example.org>
Perl module: Copyright (C) 2012 by Martin Schuette
libcrm114: Copyright (C) 2009-2010 by William S. Yerazunis
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 3.