Text::NSP::Measures - Perl modules for computing association scores of Ngrams. This module provides the basic framework for these measures.
use Text::NSP::Measures::2D::MI::ll; my $npp = 60; my $n1p = 20; my $np1 = 20; my $n11 = 10; $ll_value = calculateStatistic( n11=>$n11, n1p=>$n1p, np1=>$np1, npp=>$npp); if( ($errorCode = getErrorCode())) { print STDERR $errorCode." - ".getErrorMessage()."\n""; } else { print getStatisticName."value for bigram is ".$ll_value."\n""; }
These modules provide perl implementations of mathematical functions (association measures) that can be used to interpret the co-occurrence frequency data for Ngrams. We define an Ngram as a sequence of 'n' tokens that occur within a window of at least 'n' tokens in the text; what constitutes a "token" can be defined by the user.
The measures that have been implemented in this distribution are:
Further discussion about these measures is in their respective documentations.
This module also provides a basic framework for building new measures of association for Ngrams. The new Measure should either inherit from Text::NSP::Measures::2D or Text::NSP::Measures::3D modules, depending on whether it is a bigram or a trigram measure. Both these modules implement methods that retrieve observed frequency counts, marginal totals, and also compute expected values. They also provide error checks for these counts.
You can either write your new measure as a new module, or you can simply write a perl program. Here we will describe how to write a new measure as a perl module Perl.
h2xs -AXc -n Text::NSP::Measures::2D::NewMeasure (for bigram measures)
orh2xs -AXc -n Text::NSP::Measures::3D::NewMeasure (for trigram measures)
This will create a new folder namely...
Text-NSP-Measures-2D-NewMeasure (for bigram)
orText-NSP-Measures-3D-NewMeasure (for trigram)
This will create an empty framework for the new association measure. Once you are done completing the changes you will have to install the module before you can use it.
To make changes to the module open:
Text-NSP-Measures-2D-NewMeasure/lib/Text/NSP/Measures/2D/NewMeasure/ NewMeasure.pm
orText-NSP-Measures-3D-NewMeasure/lib/Text/NSP/Measures/3D/NewMeasure/ NewMeasure.pm
in your favorite text editor, and do as follows.
package Text::NSP::Measures::2D::NewMeasure; (for bigram measures)
orpackage Text::NSP::Measures::3D::NewMeasure; (for trigram measures)
To inherit the functionality from the 2D or 3D module you need to include it in your NewMeasure.pm module.
A small code snippet to ensure that it is included is as follows:
use Text::NSP::Measures::2D::MI;
use Text::NSP::Measures::2D::MI;
You also need to insert the following lines to make sure that the required functions are visible to the programs using your module. These lines are same for bigrams and trigrams. The "no warnings 'redefine';" statement is used to suppress perl warnings about method overriding.
use strict; use Carp; use warnings; no warnings 'redefine'; require Exporter;
our ($VERSION, @EXPORT, @ISA);
@ISA = qw(Exporter);
@EXPORT = qw(initializeStatistic calculateStatistic getErrorCode getErrorMessage getStatisticName);
This method is passed reference to a hash containing the frequency values for a Ngram as found in the input Ngram file.
method calculateStatistic() is expected to return a (possibly floating point) value as the value of the statistical measure calculated using the frequency values passed to it.
There exist three methods in the modules Text::NSP::Measures::2d and Text::NSP::Measures::3D in order to help calculate the ngram statistic.
These methods return the observed and expected values of the cells in the contingency table. A 2D contingency table looks like:
|word2 | not-word2| -------------------- word1 | n11 | n12 | n1p not-word1 | n21 | n22 | n2p -------------------- np1 np2 npp
Here the marginal totals are np1, n1p, np2, n2p, the Observed values are n11, n12, n21, n22 and the expected values for the corresponding observed values are represented using m11, m12, m21, m22, here m11 represents the expected value for the cell (1,1), m12 for the cell (1,2) and so on.
Before calling either computeObservedValues() or computeExpectedValues() you MUST call computeMarginalTotals(), since these methods require the marginal to be set. The computeMarginalTotals method computes the marginal totals in the contingency table based on the observed frequencies. It returns an undefined value in case of some error. In case success it returns '1'. An example of usage for the computeMarginalTotals() method is
my %values = @_;
if(!(Text::NSP::Measures::2D::computeMarginalTotals(\%values)) ){ return; }
@_ is the parameters passed to calculateStatistic. After this call the marginal totals will be available in the following variables
computeObservedValues() computes the observed values of a ngram, It can be called using the following code snippet. Please remember that you should call computeMarginalTotals() before calling computeObservedValues().
if( !(Text::NSP::Measures::2D::computeObservedValues(\%values)) ) { return; }
%value is the same hash that was initialized earlier for computeMarginalTotals.
If successful it returns 1 otherwise an undefined value is returned. The computed observed values will be available in the following variables:
Similarly, computeExpectedValues() computes the expected values for each of the cells in the contingency table. You should call computeMarginalTotals() before calling computeExpectedValues(). The following code snippet demonstrates its usage.
if( !(Text::NSP::Measures::2D::computeExpectedValues()) ) { return; }
If successful it returns 1 otherwise an undefined value is returned. The computed expected values will be available in the following variables:
1; __END__
Please see, that you can put in documentation after these lines.
i) initializeStatistic() ii) getErrorCode iii) getErrorMessage iv) getStatisticName()
statistical.pl calls initializeStatistic before calling any other method, if there is no need for any specific initialization in the measure you need not define this method, and the initialization will be handled by the Text::NSP::Measures modules initializeStatistic() method.
The getErrorCode method is called immediately after every call to method calculateStatistic(). This method is used to return the errorCode, if any, in the previous operations. To view all the possible error codes and the corresponding error message please refer to the Text::NSP documentation (perldoc Text::NSP).You can create new error codes in your measure, if the existing error codes are not sufficient.
The Text::NSP::Measures module implements both getErrorCode() and getErrorMessage() methods and these implementations of the method will be invoked if the user does not define these methods. But if you want to add some other actions that need to be performed in case of an error you must override these methods by implementing them in your module. You can invoke the Text::NSP::Measures getErrorCode() methods from your measures getErrorCode() method.
An example of this is below:
sub getErrorCode { my $code = Text::NSP::Measures::getErrorCode(); #your code here return $code; #(or any other value) } sub getErrorMessage { my $message = Text::NSP::MeasuresgetErrorMessage(); #your code here return $message; #(or any other value) }
The fourth method that may be implemented is getStatisticName(). If this method is implemented, it is expected to return a string containing the name of the statistic being implemented. This string is used in the formatted output of statistic.pl. If this method is not implemented, then the statistic name entered on the commandline is used in the formatted output.
Note that all the methods described in this section are optional. So, if the user elects to not implement these methods, no harm will be done.
The user may implement other methods too, but since statistic.pl is not expecting anything besides the five methods above, doing so would have no effect on statistic.pl.
Change to the base directory for the module, i.e. NewMeasure Then issue the following commands: perl Makefile.PL make make test make install or perl Makefile.PL PREFIX=<destination directory> make make test make install
If you get any errors in the installation process, please make sure that you have not made any syntactical error in your code and also make sure that you have already installed the Text-NSP package.
To tie it all together here is an example of a measure that computes the sum of ngram frequency counts.
package Text::NSP::Measures::2D::sum;
use Text::NSP::Measures::2D::MI::2D; use strict; use Carp; use warnings; no warnings 'redefine'; require Exporter;
our ($VERSION, @EXPORT, @ISA);
@ISA = qw(Exporter);
@EXPORT = qw(initializeStatistic calculateStatistic getErrorCode getErrorMessage getStatisticName);
$VERSION = '0.01';
sub calculateStatistic { my %values = @_;
# computes and returns the marginal totals from the frequency # combination values. returns undef if there is an error in # the computation or the values are inconsistent. if(!(Text::NSP::Measures::2D::computeMarginalTotals($values)) ){ return; } # computes and returns the observed and marginal values from # the frequency combination values. returns 0 if there is an # error in the computation or the values are inconsistent. if( !(Text::NSP::Measures::2D::computeObservedValues($values)) ) { return; } # Now for the actual calculation of the association measure my $NewMeasure = 0; $NewMeasure += $n11; $NewMeasure += $n12; $NewMeasure += $n21; $NewMeasure += $n22; return ( $NewMeasure ); }sub getStatisticName { return "Sum"; }
1; __END__
INPUT PARAMS : none
RETURN VALUES : none
RETURN VALUES : none
# INPUT PARAMS : none
# RETURN VALUES : errorCode .. The current error code.
# INPUT PARAMS : none
# RETURN VALUES : errorMessage .. The current error message.
INPUT PARAMS : none
RETURN VALUES : none
Ted Pedersen, University of Minnesota Duluth <tpederse@d.umn.edu>
Satanjeev Banerjee, Carnegie Mellon University <satanjeev@cmu.edu>
Amruta Purandare, University of Pittsburgh <amruta@cs.pitt.edu>
Bridget Thomson-McInnes, University of Minnesota Twin Cities <bthompson@d.umn.edu>
Saiyam Kohli, University of Minnesota Duluth <kohli003@d.umn.edu>
Last updated: $Id: Measures.pm,v 1.15 2006/03/25 04:21:22 saiyam_kohli Exp $
http://groups.yahoo.com/group/Ngram/
http://www.d.umn.edu/~tpederse/nsp.html
Copyright (C) 2000-2006, Ted Pedersen, Satanjeev Banerjee, Amruta Purandare, Bridget Thomson-McInnes and Saiyam Kohli
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Note: a copy of the GNU General Public License is available on the web at http://www.gnu.org/licenses/gpl.txt and is included in this distribution as GPL.txt.