The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Text::SenseClusters::LabelEvaluation::Driver - Module for evaluation of labels of the clusters.

SYNOPSIS

        The following code snippet will evaluate the labels by comparing
        them with text data for a gold-standard key from Wikipedia .

        # Including the LabelEvaluation Module.
        use Text::SenseClusters::LabelEvaluation::Driver;
        # Including the FileHandle module.
        use FileHandle;

        # File that will contain the label information.
        my $labelFileName = "temp_label.txt";

        # Defining the file handle for the label file.
        our $labelFileHandle = FileHandle->new(">$labelFileName");

        # Writing into the label file.
        print $labelFileHandle "Cluster 0 (Descriptive): George Bush, Al Gore, White House,". 
                                " COMMENTARY k, Cox News, George W, BRITAIN London, U S, ".
                                "Prime Minister, New York \n\n";
        print $labelFileHandle "Cluster 0 (Discriminating): George Bush, COMMENTARY k, Cox ".
                                "News, BRITAIN London \n\n";
        print $labelFileHandle "Cluster 1 (Descriptive): U S, Al Gore, White House, more than,". 
                                "George W, York Times, New York, Prime Minister, President ".
                                "<head>B_T</head>, the the \n\n";
        print $labelFileHandle "Cluster 1 (Discriminating): more than, York Times, President ".
                        "<head>B_T</head>, the the \n";
                                                        
        # File that will contain the topic information.
        my $topicFileName = "temp_topic.txt";

        # Defining the file handle for the topic file.
        our $topicFileHandle = FileHandle->new(">$topicFileName");

        # Writing into the Topic file.
        # Bill Clinton  ,   Tony  Blair 
        print $topicFileHandle "Bill Clinton  ,   Tony  Blair \n";

        # Closing the handles.
        close($labelFileHandle);                                                                
        close($topicFileHandle);                                                                

        # Calling the LabelEvaluation modules by passing the following options
        %inputOptions = (
                senseClusterLabelFileName => $labelFileName, 
                labelComparisonMethod => 'automate',
                goldKeyFileName => $topicFileName,
                goldKeyDataSource => 'wikipedia',
                weightRatio => 10,
                isClean =>1,
        );


        # Calling the LabelEvaluation modules by passing the name of the 
        # label and topic files.
        my $driverObject = Text::SenseClusters::LabelEvaluation::Driver->
                        new (\%inputOptions);
                
        if($driverObject->{"errorCode"}){
                print "Please correct the error before proceeding.\n\n";
                exit();
        }
        my $accuracyScore = $driverObject->evaluateLabels();
        
        # Printing the score.                   
        print "\nScore of label evaluation is :: $accuracyScore \n";

        # Deleting the temporary label and topic files.
        unlink $labelFileName or warn "Could not unlink $labelFileName: $!";                                                            
        unlink $topicFileName or warn "Could not unlink $topicFileName: $!";
        

Note: For more usage, please refer to test-cases in "t" folder of this package.

DESCRIPTION

        This Program will compare the result obtained from the SenseClusters with that 
        of Gold Standards. Gold Standards can be obtained from:
                        1. Wikipedia
                        2. Wordnet
                        3. User Provided
                        
        For fetching the Wikipedia data it use the WWW::Wikipedia module from the CPAN 
        and for comparison of Labels with Gold Standards it uses the Text::Similarity
        Module. The comparison result is then further processed to obtain the result
        and score of result.



        
        FILE FORMATS:

        1. senseClusterLabelFileName:
        ---------------------------------       
                This tells about the file that will contains the labels for the clusters generated by SenseClusters. 
                The file format for this file should be same as that of generated  by SenseClusters.
                
                For e.g:
                
                Cluster 0 (Descriptive): George Bush, Russian President, British Prime, British Minister, India Pakistan, US George, Prime Minister, 
                Cluster 0 (Discriminating): Russian President, British Minister, India Pakistan, US George, 
                Cluster 1 (Descriptive): George Bush, British Prime, weapons mass, United Nations, September 11, mass destruction, United States, 
                                        Prime Minister, military action
                Cluster 1 (Discriminating): United Nations, September 11, United States
                Cluster 2 (Descriptive): George Bush, weapons destruction, prime minister, axis evil, Saddam Hussein, weapons mass, mass destruction, 
                                        Gulf War, military action, Iraqi leader
                Cluster 2 (Discriminating): weapons destruction, prime minister, axis evil, Saddam Hussein, Gulf War, Iraqi leader
                

        2. goldKeyFileName:
        -----------------------
        This parameter contains the name of the file that contains the gold standard keys for the labels of clusters generated by
        SenseClusters.
                
        The file format provided by user for Gold-Standard key's are dependent on the following
        two parameters that user pass to call this module:
        
                1. labelComparisonMethod
                --------------------------
                        This parameter tells that whether is passing the mapping information between
                        goldkeys and clusters or not.
                        
                        Two options available are:      1. 'direct'             - this says user will provide the mapping info.
                                                                                        2. 'automate'           - this says module should find the best possible 
                                                                                                            mapping between cluster's label and goldkeys.       
                         
                2. goldKeyDataSource  
                -------------------------
                        This parameter tell this module from where it can read more information about
                        the goldkeys
                        
                        Options for this parameter are:         1. 'wikipedia'          - this tells to fetch data from wikipedia.
                                                                                                        2. 'wordnet'            - this tells to fetch data from wordnet.
                                                                                                        3. 'userData'           - this tells user will give the data along 
                                                                                                                            with mapping.
                                                                                                        
                
        
        Combinatios of the various values for the aboue two parameters will give the following six cases:       

                (Please note that separator between cluster name and Goldkeys are ":::".
                Also, the separator between Goldkeys and their data are ":::")  
                
        Case 1.         labelComparisonMethod => 'direct',                      goldKeyDataSource => 'userData' 
                

                a) In this case user should provide the mapping between the clusters and Goldkeys
                b) User should also provide the data about these goldstandard keys.
                
                                for e.g: 
                                                
                                Cluster0:::Tony Blair  
                                Cluster1:::Vladimir Putin 
                                Cluster2:::Saddam Hussein

                                Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served 
                                as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield 
                                from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in 
                                June 2007.
                                
                                Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician  
                                who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and  
                                as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman  
                                of United Russia.
                                
                                Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti  28 April 1937[2] – 30 December 2006)[3] was the fifth 
                                President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the 
                                revolutionary Arab Socialist Ba'ath Party.

        Case 2.         labelComparisonMethod => 'direct',                      goldKeyDataSource => 'wikipedia'
                
                a) In this case user just need to provide the mapping between the clusters and Goldkeys.
                b) User do not need to provide the data about these goldstandard keys. Even though, if user provides the
                   data about these topics, it will be ignored.
                
                
                                 for e.g: 
                                        Cluster0:::Tony Blair  
                                        Cluster1:::Vladimir Putin 
                                        Cluster2:::Saddam Hussein
                
                                        
        Case 3.         labelComparisonMethod => 'direct',                      goldKeyDataSource => 'wordnet'

                a) In this case also user just need to provide the mapping between the clusters and Goldkeys.
                b) User do not need to provide the data about these goldstandard keys. 
                
                                for e.g:
                                        Cluster0:::Tony Blair  
                                        Cluster1:::Vladimir Putin 
                                        Cluster2:::Saddam Hussein
                
                
        Case 4.         labelComparisonMethod => 'automate',                    goldKeyDataSource => 'userData'
                
                        a) No Mapping between the clusters and Goldkeys.
                        b) User will just need to provide the data about these goldstandard keys. 
                           
                           
                                for e.g: 
                                Tony Blair::: Anthony Charles Lynton Blair (born 6 May 1953)[1] is a British Labour Party politician who served 
                                as the Prime Minister of the United Kingdom from 1997 to 2007. He was the Member of Parliament (MP) for Sedgefield 
                                from 1983 to 2007 and Leader of the Labour Party from 1994 to 2007. He resigned from all of these positions in 
                                June 2007.
                                
                                Vladimir Putin::: Vladimir Vladimirovich Putin (Russian: ( listen); born 7 October 1952) is a Russian politician  
                                who has been the President of Russia since 7 May 2012. Putin previously served as President from 2000 to 2008, and  
                                as Prime Minister of Russia from 1999 to 2000 and again from 2008 to 2012. Putin was also previously the Chairman  
                                of United Russia.
                                
                                Saddam Hussein::: Saddam Hussein Abd al-Majid al-Tikriti  28 April 1937[2] – 30 December 2006)[3] was the fifth 
                                President of Iraq, serving in this capacity from 16 July 1979 until 9 April 2003.[4][5] A leading member of the 
                                revolutionary Arab Socialist Ba'ath Party.
                
                
        Case 5.         labelComparisonMethod => 'automate',                    goldKeyDataSource => 'wikipedia'
        
                                a) No Mapping between the clusters and Goldkeys.
                                b) User will just need to provide the comma separated goldstandard keys. 
                                        
                                for e.g: 
                                        Tony Blair , Vladimir Putin, Saddam Hussein
                                
                                
                                
        Case 6.         labelComparisonMethod => 'automate',                    goldKeyDataSource => 'wordnet'

                                a) No Mapping between the clusters and Goldkeys.
                                b) User will just need to provide the comma separated goldstandard keys. 

                                        
                                for e.g: 
                                        Tony Blair , Vladimir Putin, Saddam Hussein


                Sample files for all the cases are included in test-section of the modules. 
                

RESULT:

   a) Contingency Matrix:       
                 Based on the similarity comparison of Labels with the gold standards,
                 the Contingency Matrix is generated. Following shows an example of 
                 contingency matrix for the example mentioned in synposis:


                Original Contingency Matrix: 
                 
                                        Bill Clinton            Tony Blair  
                -------------------------------------------------
                 Cluster0                       54                              48
                -------------------------------------------------
                 Cluster1                       31                              16
                ------------------------------------------------- 
        
        b) Using Hungarian algorithm to display the new contingency matrix,
                whose diagonal elements indicates the assigned similarity-score
                between a cluster and a gold-standard key. This format of matrix
                has the maximum possible diagonal's total.   
        
                Example:
                
                Contigency Matrix after Hungarian Algorithm: 
                 
                                                Tony Blair      Bill Clinton  
                -------------------------------------------------
                 Cluster0                       48                              54
                -------------------------------------------------
                 Cluster1                       16                              31
                -------------------------------------------------
        

        c) Conclusion: Displays the conclusion of the Hungarian algorithm:
                        
                        Example:
                        
                        Final Conclusion using Hungarian Algorithm::
                                Cluster0        <-->    Tony  Blair 
                                Cluster1        <-->    Bill Clinton  
        
        
        d) Displaying the overall accuracy for the label assignment:
                
                                                                        Sum (Diagonal Scores)
                        Accuracy =       -------------------------------------------
                                                        Sum (All the Scores of contingency table)
                        
                        Example:                                
                        Accuracy of labels is 53.02%                    

Help

The LabelEvaluation module expect the 'OptionsHash' as the required argument.

The 'optionHash' has the following elements:

1. labelFile: Name of the file containing the labels from SenseClusters. The syntax of file must be similar to label file from SenseClusters. This is mandatory parameter.

2. labelComparisonMethod: Name of the method for comparing the labels with GoldKey. This method tells the program whether the keyFile provided by the User will have the mapping between the assigned labels and expected topics of the clusters. Possible options are : A) 'DirectAssignment' and B) 'AutomateAssignment'.

This is mandatory parameter.

3. goldKeyFile: Name of the file containing the actual topics (keys) and their data for the clusters. This is mandatory parameter.

4. goldKeyLength: This parameter tells about the length of data to be fetched from the external resource such as Wikipedia. The data will be used as reference data. Default value for this parameter is the first section of the Wikipedia page.

5. goldKeyDataSource: This parameter tell the name of external application or user supplied file name from where we will get the key's data. For now supported options are: 1. 'Wikipedia' 2. 'User' 3. 'Wordnet' (Will be supported in future).

This is the mandatory parameter.

6. weightRatio: This ratio tells us about the weightage we should provide to Discriminating label over the descriptive label. Default value is set to 10.

7. stopList: This is the name of file which contains the list of all stop words. This is the optional parameter and its formating should match the requirement of the Text:: Simialrity i.e. a single stop word in a single line. for e.g:

        Content of stoplist.txt should look like:
                        the
                        of
                        in
                        :
                        :
                        to
        

8. isClean: This variable will decide whether to keep or delete temporary files.Default value is 'true'.

9. verbose: Variable used for the deciding whether to show detailed results to user or not. Default value = Off (0), to make it 'On' change value to 1.

10. help : This variable will decide whether to display help to user or not. Default value for this parameter is 0.

        %inputOptions = (
                senseClusterLabelFileName => '<filelocation>/<SenseClusterLabelFileName>', 
                labelComparisonMethod => 'DirectAssignmentOrAutomateAssignment',
                goldKeyFileName => '<filelocation>/<ActualTopicName>',
                goldKeyLength => '<LenghtOfDataFetchedFromExternalResource>',
                goldKeyDataSource => '<NameOfSourceFromWhichTopicDataBeFeteched>',
                weightRatio => '<WeightageRatioOfDiscriminatingToDiscriptiveLabel>',
                stopListFileLocation => '<filelocation>/<StopListFileLocation>',
                isClean => 1,
                verbose => 0,
                help => 0
        );

Examples

1. With minimum parameters:

                %inputOptions = (
                        senseClusterLabelFileName => 'labelFile.txt', 
                        labelComparisonMethod => 'DirectAssignment',
                        goldKeyFileName => 'goldKeyFile.txt',
                        goldKeyDataSource => 'UserData'
                );
        

The above mentioned four mandatory parameters.

2. For Help:

                %inputOptions = (
                        help => 1
                );
                

3. With all parameters:

        %inputOptions = (
                senseClusterLabelFileName => 'labelFile.txt', 
                labelComparisonMethod => 'AutomateAssignment',
                goldKeyFileName => 'goldKeyFile.txt',
                goldKeyLength => 2000,
                goldKeyDataSource => 'Wikipedia',
                weightRatio => 10,
                stopListFileLocation => 'stoplist.txt',
                isClean => 1,
                verbose => 1,
                help => 0
        );
        

Constructor: new()

This is the constructor which will create object for this class. Reference : http://perldoc.perl.org/perlobj.html

This constructor takes the hash argument and intialize it for the class.

                %inputOptions = (
                        senseClusterLabelFileName => 'value1', 
                        labelComparisonMethod => 'value2',
                        goldKeyFileName => 'value3',
                        goldKeyLength => value4,
                        goldKeyDataSource => 'value5',
                        weightRatio => value6,
                        stopListFileLocation => 'value7',
                        isClean => value8,
                        verbose => value9,
                        help => value10
                );
                

Please refer to section "help" about the detailed discussion on this hash.

Function: evaluateLabels

Function which is responsible for evaluating the labels of the clusters. This function will call the other modules for completing the process.

@argument : $driverObject : Object of the current file.

@return : $accuracy : DataType(Float) Indicates the overall accuracy of the assignments.

@description :

                Overall algorithm for calculating the accuracy of the labels assignment with the help of gold 
                standard keys are:
                
                Step 1: Read the clusters and their labels information from the ClusterLabel file.
                        
                Case A: User has provided the mapping information about the cluster and gold standard key.
                                Step 2:Read Clusters-Topics mapping information.
                                
                                Subcase1: User provides data for gold standard keys.
                                                        
                                                        Step 3:Read the gold standard keys and their data from the file provided by user.
                                                        Step 4: continue to next step :).
                                                        
                                Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.
                                                   User will just provide the data about the topics, but no mapping.
                                                   
                                                        Step 3:Read gold standard keys from the file provided by user.
                                                        Step 4:Read data about the gold standard keys from the Wikipedia.
                                                        
                                Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
                                
                                                        Step 3:Read gold standard keys from the file provided by user.
                                                        Step 4:Read data about the gold standard keys from the Wordnet.
                                
                                Step 5: Create contingency matrix with similarity-scores of cluster's label against each 
                                                 gold standard key's data (obtained from steps 3 and 4.)
                                Step 6: Using the mapping provided by user(step 2) to calculate the diagonal score for the 
                                                 contingency matrix.
                                Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :                                
                                
                                                                                  Sum (Diagonal Scores)
                                                        Accuracy =  -------------------------------------------
                                                                                  Sum (All the Scores of contingency table)
                                                                                        
                Case B: User has not provided the mapping information about the cluster and gold standard key.
                                 We will use the Hungarian algorithm to compute the mapping.
                                        
                                Subcase1: User provides data for gold standard keys.
                                                        
                                                        Step 2: Read the gold standard keys and their data from the file provided by user.
                                                        Step 3: Continue to next step :).
                                                        
                                Subcase2: User provides the gold standard keys. We will fetch data from Wikipedia.
                                                   User will just provide the data about the topics, but no mapping.
                                                        
                                                        Step 2: Read gold standard keys from the file provided by user.
                                                        Step 3: Read data about the gold standard keys from the Wikipedia.
                                                        
                                Subcase3: User provides the gold standard keys. We will fetch data from Wordnet.
                                
                                                        Step 2: Read gold standard keys from the file provided by user.
                                                        Step 3: Read data about the gold standard keys from the Wordnet.

                                Step 4: Create contingency matrix with similarity-scores of cluster's label against each 
                                                 gold standard key's data (obtained from steps 3 and 4.)
                                Step 5: Use Hungarian algorithm to determine the mapping of Clusters with gold standard keys.  
                                Step 6: Use the above mapping to calculate the total diagonal score for the new contingency matrix. 
                                Step 7: Overall Accuracy for the current cluster's label assignment can be calculated as :                                
                                
                                                                                  Sum (Diagonal Scores)
                                                        Accuracy =       -------------------------------------------
                                                                                  Sum (All the Scores of contingency table)

Function: makeContigencyMatrix

This method is responsible for making the Contigency Matrix containing the similarity-scores of the labels with the data of the gold standard keys.

@argument : $labelSenseClustersHashRef (Hash containing the labels generated by the SenseClusters) @argument : $topicDataHashRef (Hash containing the data of the gold standard keys) @argument : $weightageRatio (Parameter which tells the weightage to be given to discriminating labels over descriptive labels of the SenseClusters)

@return : 1. @matrixScore - Contingency matrix containing the similarity-scores. 2. @colHeader - Array containing the column header for the contingency matrix. 3. @rowHeader - Array containing the row header for the contingency matrix. 4. $totalMatrixScore - Total similarity scores of the contingency matrix.

@description :

        1). It will iterate through the hash (%labelSenseClustersHash) and extracts the descriptive and discriminating labels for each clusters.
        2). It will read the data about each gold standard key from the hash (%topicDataHash).
        3). It then uses the module, Text::SenseClusters::LabelEvaluation::SimilarityScore to get various similarity score.
        4). Finally, it uses the raw-lesk scores to prepare the contingency  matrix.
        

Function: calculateAccuracy

Method used for calculating the Accuracy score for the labels generated by the SenseClusters or others.

@argument1 : $mappingHashRef (Reference to Hash which contains the mapping information about the cluster and gold standard) \\ @argument2 : $matrixScoreRef (2-D Array/Matrix which contains the similarity-scores of each labels) \\ @argument3 : $colHeaderRef (Reference of array which contains the column header) \\ @argument4 : $rowHeaderRef (Reference of array which contains the row header) @argument5 : $totalMatrixScore (Total similarity score of the labels with gold standard)

@return : Return the overall accuracy of the labels assigned by the SenseClusters.

@description :

                1). With the help of ()$mappingHashRef $matrixScoreRef $colHeaderRef $rowHeaderRef),
                    this function try to calculate the sum of all diagonal elements.
                2).  It will then calculate the accuracy for the assignment as
        
                                                         Sum (Diagonal Scores)
                                Accuracy =       -------------------------
                                                         Sum (All the Scores)
                                                                

BUGS

  • Currently not supporting the WordNet gold standards comparison.

SEE ALSO

http://senseclusters.cvs.sourceforge.net/viewvc/senseclusters/LabelEvaluation/

Last modified by : $Id: Driver.pm,v 1.3 2013/03/07 23:22:22 jhaxx030 Exp $

AUTHORS

        Anand Jha, University of Minnesota, Duluth
        jhaxx030 at d.umn.edu

        Ted Pedersen, University of Minnesota, Duluth
        tpederse at d.umn.edu

COPYRIGHT AND LICENSE

Copyright (C) 2012-2013 Ted Pedersen, Anand Jha

See http://dev.perl.org/licenses/ for more information.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

        The Free Software Foundation, Inc., 59 Temple Place, Suite 330, 
        Boston, MA  02111-1307  USA
        
        

1 POD Error

The following errors were encountered while parsing the POD:

Around line 193:

Non-ASCII character seen before =encoding in '–'. Assuming UTF-8