The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Text::SenseClusters::LabelEvaluation - Module for evaluation of labels of the clusters.

SYNOPSIS

        The following code snippet will evaluate the labels by comparing
        them with text data for a gold-standard key from Wikipedia .

        # Including the LabelEvaluation Module.
        use Text::SenseClusters::LabelEvaluation::LabelEvaluation;
        # Including the FileHandle module.
        use FileHandle;

        # File that will contain the label information.
        my $labelFileName = "temp_label.txt";

        # Defining the file handle for the label file.
        our $labelFileHandle = FileHandle->new(">$labelFileName");

        # Writing into the label file.
        print $labelFileHandle "Cluster 0 (Descriptive): George Bush, Al Gore, White House,". 
                                " COMMENTARY k, Cox News, George W, BRITAIN London, U S, ".
                                "Prime Minister, New York \n\n";
        print $labelFileHandle "Cluster 0 (Discriminating): George Bush, COMMENTARY k, Cox ".
                                "News, BRITAIN London \n\n";
        print $labelFileHandle "Cluster 1 (Descriptive): U S, Al Gore, White House, more than,". 
                                "George W, York Times, New York, Prime Minister, President ".
                                "<head>B_T</head>, the the \n\n";
                print $labelFileHandle "Cluster 1 (Discriminating): more than, York Times, President ".
                                "<head>B_T</head>, the the \n";
                                                        
        # File that will contain the topic information.
        my $topicFileName = "temp_topic.txt";

        # Defining the file handle for the topic file.
        our $topicFileHandle = FileHandle->new(">$topicFileName");

        # Writing into the Topic file.
        # Bill Clinton  ,   Tony  Blair 
        print $topicFileHandle "Bill Clinton  ,   Tony  Blair \n";

        # Closing the handles.
        close($labelFileHandle);                                                                
        close($topicFileHandle);                                                                

        # Calling the LabelEvaluation modules by passing the following options

        %inputOptions = (

                        labelFile => $labelFileName, 
                        labelKeyFile => $topicFileName
        );      


        # Calling the LabelEvaluation modules by passing the name of the 
        # label and topic files.
        my $score = Text::SenseClusters::LabelEvaluation::LabelEvaluation->
                        new (\%inputOptions);
                

        # Printing the score.                   
        print "\nScore of label evaluation is :: $score \n";

        # Deleting the temporary label and topic files.
        unlink $labelFileName or warn "Could not unlink $labelFileName: $!";                                                            
        unlink $topicFileName or warn "Could not unlink $topicFileName: $!";

DESCRIPTION

        This Program will compare the result obtained from the SenseClusters with that 
        of Gold Standards. Gold Standards will be obtained from two independent and 
        reliable source:
                        1. Wikipedia
                        2. Wordnet
                        
        For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN 
        and for comparison of Labels with Gold Standards it uses the Text::Similarity
        Module. The comparison result is then further processed to obtain the result
        and score of result.
                        

Result:

   a) Decision Matrix:  
                 Based on the similarity comparison of Labels with the gold standards,
                 the decision matrix are calculated as below:

        For eg:
        ===========================================================================
                                |       Cluster0        |       Cluster1        |               Row Total
        ---------------------------------------------------------------------------
        Topic#1         |               271     |               2713    |                       2984
        ---------------------------------------------------------------------------
        Topic#2         |               2396    |               306     |                       2702
        ---------------------------------------------------------------------------
        Col Total       |               2667    |               3019    |                       5686
        ===========================================================================

        b) Calculated decision Matrix:  
                 Now based on decision matrix, a new calculated matrix is printed. 
                 Each of the cell in the matrix, will contains the probabilities value:
                        
                                CELL_VALUE_IN_DECISION_MATRIX / TOTAL_SCORE_OF_DECISION_MATRIX
                        
                
        For eg:
                For cell : Cluster0 - Topic#1   
                        i) First -Value = 271 / 5686 = 0.048


                 Now based on above decision matrix, new calculated matrix is: 
        ========================================================================
                                |       Cluster0                |       Cluster1        
        ------------------------------------------------------------------------
        Topic#1         |       0.048                   |       0.477
        ------------------------------------------------------------------------
        Topic#2         |       0.421                   |       0.054
        ------------------------------------------------------------------------


        c) Interpreting Calculated decision Matrix:     
                
                        1. Row-Wise Comparison
                                For each topic, "row score" will be compared and cluster with maximum 
                                value will be assigned to that topic.
                                for eg: 
                                        a) Topic#1      Cluster1     (max-row-score = 0.477 )
                                        b) Topic#2      Cluster0     (max-row-score = 0.421 )
                                        
                        2. Col-Wise Comparison
                                For each Cluster, "col score" will be compared and topic with maximum 
                                value will be assigned to that Cluster.
                                for eg: 
                                        a) Cluster0     Topic#2     (max-col-score = 0.421 )
                                        b) Cluster1     Topic#1     (max-col-score = 0.477 )

        d)      Deriving final conclusion from above two comparison:
                
                Result of Row-Wise comparison and Column-wise comparison is matched.
                Only matching result is then printed.

                For eg:
                        1. Row-Wise Comparison
                                 a) Topic#1     Cluster1 
                                 b) Topic#2     Cluster0 
                        2. Col-Wise Comparison
                                 a) Cluster0    Topic#2    
                                 b) Cluster1    Topic#1 

                Matching Result: 
                                Cluster0        Topic#2
                                Cluster1        Topic#1   

        e) Overall score:
                        This is the multiplication of all the probability scores of all
                        matching cluster and topics.
                        
                        For eg:
                                The score for above example will be: 0.201
                         
                        

Help --------------------

The LabelEvaluation module expect the 'OptionsHash' as the required argument. The 'optionHash' has the following elements:

1. labelFile: Name of the file containing the labels from sense cluster. The syntax of file must be similar to label file from SenseClusters. This is the mandatory option.

2. labelKeyFile: Name of the file containing the comma separated actual topics (keys) for the clusters. This is the mandatory option.

3. labelKeyLength: This parameters tell about the length of data to be fetched from Wikipedia which will be used as reference data. Default is the first section of the Wikipedia page.

4. weightRatio: This ratio tells us about how much the weight we should provide to Discriminating label to that of the descriptive label. Default value is set to 10.

5. stopList: This is the name of file which contains the list of all stop words. This is the optional parameter.

6. isClean: This option tells us whether to keep temporary files or not. Default value is true

7. verbose: This option will let you see details output. Default value is false.

8. help : This option will show the details about running this module. This is the optional parameter.

        %inputOptions = (
        
                labelFile => '<filelocation>/<SenseClusterLabelFileName>', 
                labelKeyFile => '<filelocation>/<ActualTopicName>',
                labelKeyLength=> '<LenghtOfDataFetchedFromWikipedia>',
                weightRatio=> '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
                stopList=> '<filelocation>/<StopListFileLocation>',
                isClean=> 1,
                verbose=> 1,
                help=> 'help'
        );
        

function: makeDecisionOfSense

This function will do the evaluation of labels.

@argument1 : LabelSenseClusters DataType(Reference to HashOfHash)

@argument2 : StandardReferenceName: DataType(String) Name of the external application. Currently, its two possible values are: 1. Wikipedia 2. WordNet

@argument3 : StandardTerms: DataType(String) Terms(comma separated) to be sent to Wikipedia or Wordnet for getting the Gold Standard Labels.

@return : Score : DataType(Float) Indicates the measure of overlap of current label mechanisms with the Gold Standard Labels.

@description : 1). It will go through the Hash which contains the clusters and label terms. 2). Each cluster's label terms will be written to a file whose name will be same as of cluster name(or number). 3). Then, this will go through the Standard terms against which we have to compare the cluster labels. 4). We will then create the files with name of the terms and content of the file will be data fetched from the Wikipedia against a topic. 5). Then, cluster's data and topic's data are compared using the method from Text::Similarity::Overlaps. 6). Finally the calculated scores are used further for decision matrix and getting the final score value.

function: printDecisionMatrix

This function is responsible for printing the decision matrix.

@argument1 : clusterNameArrayRef: DataType(Reference_Of_Array) Reference to Array containing Cluster Name.

@argument2 : standardTermsArrayRef: DataType(Reference_Of_Array) Reference to Array containing Standard terms.

@argument3 : hashForClusterTopicScoreRef: DataType(Reference_Of_Hash) Reference to hash containing Cluster Name, corresponding StandardTopic and its score.

@return1 : topicTotalSumHash: DataType(Reference_Of_Hash) Hash which will contains the total score for a topic against each clusters.

@return2 : clusterTotalSumHash: DataType(Reference_Of_Hash) Hash which will contains the total score for a cluster against each topics.

@description : 1). It will go through the Hash which contains the similarity score for each clusters against standard label terms. 2). This uses the above hash to print the decision matrix. Below has the example of the decision matrix. 3). It will also use the ScoringHash to get new hashes which will store a) total score for a cluster against each topics. b) total score for a topic against each cluster.

Example of decision Matrix

                ==============================================================================
                                                        |       Cluster0        |       Cluster1
                ------------------------------------------------------------------------------
                        Bill Clinton:   |               11              |               12              |               23(ROW TOTAL)
                ------------------------------------------------------------------------------
                ------------------------------------------------------------------------------
                        Tony Blair:     |               15              |               9               |               24 (ROW TOTAL)
                ------------------------------------------------------------------------------
                        Total                   |               26              |               21              |               47
                                                        | (COL TOTAL)   | (COL TOTAL)   |   (Total Matrix Sum)


 Where, 1) Cluster0, Cluster1 are  Cluster Names.
                2) Bill Clinton, Tony Blair are  Standard Topics.
                3) 23, 24 are Row Total of the Topic score.                     (ROW TOTAL)
                4) 26, 21 are Col Total of the ClusterName Score.               (COL TOTAL)
                5) 47 is Total sum of the scores of all clusters again all topics.              
                        (Total Matrix Sum)

BUGS

  • Supports input of label and topic values through files. Should be able to accept as string value

  • Currently not supporting the WordNet gold standards comparison.

SEE ALSO

http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/

@Last modified by : Anand Jha @Last_Modified_Date : 24th Dec. 2012 @Modified Version : 1.15

AUTHORS

        Ted Pedersen, University of Minnesota, Duluth
        tpederse at d.umn.edu

        Anand Jha, University of Minnesota, Duluth
        jhaxx030 at d.umn.edu

COPYRIGHT AND LICENSE

Copyright (C) 2012 Ted Pedersen, Anand Jha

See http://dev.perl.org/licenses/ for more information.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

        The Free Software Foundation, Inc., 59 Temple Place, Suite 330, 
        Boston, MA  02111-1307  USA