The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Text::SenseClusters::LabelEvaluation::Driver - Module for evaluation of labels of the clusters.

SYNOPSIS

        The following code snippet will evaluate the labels by comparing
        them with text data for a gold-standard key from Wikipedia .

        # Including the LabelEvaluation Module.
        use Text::SenseClusters::LabelEvaluation::Driver;
        # Including the FileHandle module.
        use FileHandle;

        # File that will contain the label information.
        my $labelFileName = "temp_label.txt";

        # Defining the file handle for the label file.
        our $labelFileHandle = FileHandle->new(">$labelFileName");

        # Writing into the label file.
        print $labelFileHandle "Cluster 0 (Descriptive): George Bush, Al Gore, White House,". 
                                " COMMENTARY k, Cox News, George W, BRITAIN London, U S, ".
                                "Prime Minister, New York \n\n";
        print $labelFileHandle "Cluster 0 (Discriminating): George Bush, COMMENTARY k, Cox ".
                                "News, BRITAIN London \n\n";
        print $labelFileHandle "Cluster 1 (Descriptive): U S, Al Gore, White House, more than,". 
                                "George W, York Times, New York, Prime Minister, President ".
                                "<head>B_T</head>, the the \n\n";
        print $labelFileHandle "Cluster 1 (Discriminating): more than, York Times, President ".
                        "<head>B_T</head>, the the \n";
                                                        
        # File that will contain the topic information.
        my $topicFileName = "temp_topic.txt";

        # Defining the file handle for the topic file.
        our $topicFileHandle = FileHandle->new(">$topicFileName");

        # Writing into the Topic file.
        # Bill Clinton  ,   Tony  Blair 
        print $topicFileHandle "Bill Clinton  ,   Tony  Blair \n";

        # Closing the handles.
        close($labelFileHandle);                                                                
        close($topicFileHandle);                                                                

        # Calling the LabelEvaluation modules by passing the following options
        %inputOptions = (
                senseClusterLabelFileName => $labelFileName, 
                labelComparisonMethod => 'automate',
                goldKeyFileName => $topicFileName,
                goldKeyDataSource => 'wikipedia',
                weightRatio => 10,
                stopListFileLocation => 'stoplist.txt',
        );


        # Calling the LabelEvaluation modules by passing the name of the 
        # label and topic files.
        my $driverObject = Text::SenseClusters::LabelEvaluation::Driver->
                        new (\%inputOptions);
                
        if($driverObject->{"errorCode"}){
                print "Please correct the error before proceeding.\n\n";
                exit();
        }
        my $accuracyScore = $driverObject->evaluateLabels();
        
        # Printing the score.                   
        print "\nScore of label evaluation is :: $accuracyScore \n";

        # Deleting the temporary label and topic files.
        unlink $labelFileName or warn "Could not unlink $labelFileName: $!";                                                            
        unlink $topicFileName or warn "Could not unlink $topicFileName: $!";
        

Note: For more usage, please refer to test-cases in "t" folder of this package.

DESCRIPTION

        This Program will compare the result obtained from the SenseClusters with that 
        of Gold Standards. Gold Standards can be obtained from:
                        1. Wikipedia
                        2. Wordnet
                        3. User Provided
                        
        For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN 
        and for comparison of Labels with Gold Standards it uses the Text::Similarity
        Module. The comparison result is then further processed to obtain the result
        and score of result.

RESULT:

   a) Contingency Matrix:       
                 Based on the similarity comparison of Labels with the gold standards,
                 the Contingency Matrix is generated. Following shows an example of 
                 contingency matrix for the example mentioned in synposis:


                Original Contingency Matrix: 
                 
                                        Bill Clinton            Tony Blair  
                -------------------------------------------------
                 Cluster0                       54                              48
                -------------------------------------------------
                 Cluster1                       31                              16
                ------------------------------------------------- 
        
        b) Using Hungarian algorithm to display the new contingency matrix,
                whose diagonal elements indicates the assigned similarity-score
                between a cluster and a gold-standard key. This format of matrix
                has the maximum possible diagonal's total.   
        
                Example:
                
                Contigency Matrix after Hungarian Algorithm: 
                 
                                                Tony Blair      Bill Clinton  
                -------------------------------------------------
                 Cluster0                       48                              54
                -------------------------------------------------
                 Cluster1                       16                              31
                -------------------------------------------------
        

        c) Conclusion: Displays the conclusion of the Hungarian algorithm:
                        
                        Example:
                        
                        Final Conclusion using Hungarian Algorithm::
                                Cluster0        <-->    Tony  Blair 
                                Cluster1        <-->    Bill Clinton  
        
        
        d) Displaying the overall accuracy for the label assignment:
                
                                                                        Sum (Diagonal Scores)
                        Accuracy =       -------------------------------------------
                                                        Sum (All the Scores of contingency table)
                        
                        Example:                                
                        Accuracy of labels is 53.02%                    

Help

The LabelEvaluation module expect the 'OptionsHash' as the required argument.

The 'optionHash' has the following elements:

1. labelFile: Name of the file containing the labels from SenseClusters. The syntax of file must be similar to label file from SenseClusters. This is mandatory parameter.

2. labelComparisonMethod: Name of the method for comparing the labels with GoldKey. This method tells the program whether the keyFile provided by the User will have the mapping between the assigned labels and expected topics of the clusters. Possible options are : A) 'DirectAssignment' and B) 'AutomateAssignment'.

   This is mandatory parameter.

3. goldKeyFile: Name of the file containing the actual topics (keys) and their data for the clusters. This is mandatory parameter.

4. goldKeyLength: This parameter tells about the length of data to be fetched from the external resource such as Wikipedia. The data will be used as reference data. Default value for this parameter is the first section of the Wikipedia page.

5. goldKeyDataSource: This parameter tell the name of external application or user supplied file name from where we will get the key's data. For now supported options are: 1. 'Wikipedia' 2. 'User' 3. 'Wordnet' (Will be supported in future).

        This is the mandatory parameter.
        

6. weightRatio: This ratio tells us about the weightage we should provide to Discriminating label over the descriptive label. Default value is set to 10.

7. stopList: This is the name of file which contains the list of all stop words.

8. isClean: This variable will decide whether to keep or delete temporary files.Default value is 'true'.

9. verbose: Variable used for the deciding whether to show detailed results to user or not. Default value = Off (0), to make it 'On' change value to 1.

10. help : This variable will decide whether to display help to user or not. Default value for this parameter is 0.

        %inputOptions = (
                senseClusterLabelFileName => '<filelocation>/<SenseClusterLabelFileName>', 
                labelComparisonMethod => 'DirectAssignmentOrAutomateAssignment',
                goldKeyFileName => '<filelocation>/<ActualTopicName>',
                goldKeyLength => '<LenghtOfDataFetchedFromExternalResource>',
                goldKeyDataSource => '<NameOfSourceFromWhichTopicDataBeFeteched>',
                weightRatio => '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
                stopListFileLocation => '<filelocation>/<StopListFileLocation>',
                isClean => 1,
                verbose => 0,
                help => 0
        );

Examples

1. With minimum parameters: %inputOptions = ( senseClusterLabelFileName => 'labelFile.txt', labelComparisonMethod => 'DirectAssignment', goldKeyFileName => 'goldKeyFile.txt', goldKeyDataSource => 'UserData' );

                The above mentioned four parameters are mandatory.
                

2. For Help:

                %inputOptions = (
                        help => 1
                );
                

3. With all parameters: %inputOptions = ( senseClusterLabelFileName => 'labelFile.txt', labelComparisonMethod => 'AutomateAssignment', goldKeyFileName => 'goldKeyFile.txt', goldKeyLength => 2000, goldKeyDataSource => 'Wikipedia', weightRatio => 10, stopListFileLocation => 'stoplist.txt', isClean => 0, verbose => 1, help => 0 );

Constructor: new()

This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html

This constructor takes the hash argument and intialize it for the class.

                %inputOptions = (
                        senseClusterLabelFileName => 'value1', 
                        labelComparisonMethod => 'value2',
                        goldKeyFileName => 'value3',
                        goldKeyLength => value4,
                        goldKeyDataSource => 'value5',
                        weightRatio => value6,
                        stopListFileLocation => 'value7',
                        isClean => value8,
                        verbose => value9,
                        help => value10
                );
                

Please refer to section "help" about the detailed discussion on this hash.

Function: evaluateLabels

Function which is responsible for evaluating the labels of the clusters. This function will call the other modules for completing the process.

@argument : $driverObject : Object of the current file.

@return : $accuracy : DataType(Float) Indicates the overall accuracy of the assignments.

@description :

                Overall algorithm for calculating the accuracy of the labels assignment with the help of gold 
                standard keys are:
                
                Step 1: Read the clusters and their labels information from the ClusterLabel file.
                        
                Case A: User has provided the mapping information about the cluster and gold standard key.
                                Step 2:Read Clusters-Topics mapping information.
                                
                                Subcase1: User provides data for gold standard keys.
                                                        
                                                        Step 3:Read the gold standard keys and their data from the file provided by user.
                                                        Step 4: continue to next step :).
                                                        
                                Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.
                                                   User will just provide the data about the topics, but no mapping.
                                                   
                                                        Step 3:Read gold standard keys from the file provided by user.
                                                        Step 4:Read data about the gold standard keys from the Wikipedia.
                                                        
                                Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
                                
                                                        Step 3:Read gold standard keys from the file provided by user.
                                                        Step 4:Read data about the gold standard keys from the Wordnet.
                                
                                Step 5: Create contingency matrix with similarity-scores of cluster's label against each 
                                                 gold standard key's data (obtained from steps 3 and 4.)
                                Step 6: Using the mapping provided by user(step 2) to calculate the diagonal score for the 
                                                 contingency matrix.
                                Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :                                
                                
                                                                                                        Sum (Diagonal Scores)
                                                        Accuracy =       -------------------------------------------
                                                                                        Sum (All the Scores of contingency table)
                                                                                        
                Case B: User has not provided the mapping information about the cluster and gold standard key.
                                 We will use the Hungarian algorithm to compute the mapping.
                                        
                                Subcase1: User provides data for gold standard keys.
                                                        
                                                        Step 2: Read the gold standard keys and their data from the file provided by user.
                                                        Step 3: Continue to next step :).
                                                        
                                Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.
                                                   User will just provide the data about the topics, but no mapping.
                                                        
                                                        Step 2: Read gold standard keys from the file provided by user.
                                                        Step 3: Read data about the gold standard keys from the Wikipedia.
                                                        
                                Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
                                
                                                        Step 2: Read gold standard keys from the file provided by user.
                                                        Step 3: Read data about the gold standard keys from the Wordnet.

                                Step 4: Create contingency matrix with similarity-scores of cluster's label against each 
                                                 gold standard key's data (obtained from steps 3 and 4.)
                                Step 5: Use Hungarian algorithm to determine the mapping of Clusters with gold standard keys.  
                                Step 6: Use the above mapping to calculate the total diagonal score for the new contingency matrix. 
                                Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :                                
                                
                                                                                                        Sum (Diagonal Scores)
                                                        Accuracy =       -------------------------------------------
                                                                                        Sum (All the Scores of contingency table)

function: makeContigencyMatrix

This method is responsible for making the Contigency Matrix containing the similarity-scores of the labels with the data of the gold standard keys.

@argument : $labelSenseClustersHashRef (Hash containing the labels generated by the SenseClusters) @argument : $topicDataHashRef (Hash containing the data of the gold standard keys) @argument : $weightageRatio (Parameter which tells the weightage to be given to discriminating labels over descriptive labels of the SenseClusters)

@return : 1. @matrixScore - Contingency matrix containing the similarity-scores. 2. @colHeader - Array containing the column header for the contingency matrix. 3. @rowHeader - Array containing the row header for the contingency matrix. 4. $totalMatrixScore - Total similarity scores of the contingency matrix.

@description : 1). It will iterate through the hash (%labelSenseClustersHash) and extracts the descriptive and discriminating labels for each clusters. 2). It will read the data about each gold standard key from the hash (%topicDataHash). 3). It then uses the module, Text::SenseClusters::LabelEvaluation::SimilarityScore to get various similarity score. 4). Finally, it uses the raw-lesk scores to prepare the contingency matrix.

Function: calculateAccuracy

Method used for calculating the Accuracy score for the labels generated by the SenseClusters or others.

@argument1 : $mappingHashRef (Reference to Hash which contains the mapping information about the cluster and gold standard) @argument2 : $matrixScoreRef (2-D Array/Matrix which contains the similarity-scores of each labels) @argument3 : $colHeaderRef (Reference of array which contains the column header) @argument4 : $rowHeaderRef (Reference of array which contains the row header) @argument5 : $totalMatrixScore (Total similarity score of the labels with gold standard)

@return : Return the overall accuracy of the labels assigned by the SenseClusters.

@description : 1). With the help of ()$mappingHashRef $matrixScoreRef $colHeaderRef $rowHeaderRef), this function try to calculate the sum of all diagonal elements. 2). It will then calculate the accuracy for the assignment as

                                                                Sum (Diagonal Scores)
                                Accuracy =       -------------------------
                                                                Sum (All the Scores)
                                                                

BUGS

  • Currently not supporting the WordNet gold standards comparison.

SEE ALSO

http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/

Last modified by : $Id: Driver.pm,v 1.2 2013/02/14 03:50:08 jhaxx030 Exp $

AUTHORS

        Anand Jha, University of Minnesota, Duluth
        jhaxx030 at d.umn.edu

        Ted Pedersen, University of Minnesota, Duluth
        tpederse at d.umn.edu

COPYRIGHT AND LICENSE

Copyright (C) 2012-2013 Ted Pedersen, Anand Jha

See http://dev.perl.org/licenses/ for more information.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

        The Free Software Foundation, Inc., 59 Temple Place, Suite 330, 
        Boston, MA  02111-1307  USA