Ted Pedersen > Text-SenseClusters-1.03 > report.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

report.pl - Summarize SenseClusters results with precision, recall, and confusion matrix

SYNOPSIS ^

 report.pl [OPTIONS] LABEL PRELABEL

Type report.pl --help for a quick summary of options

DESCRIPTION ^

Reports the performance of discrimination in terms of the precision, recall and confusion table.

INPUT ^

Required Arguments:

LABEL

An output created by label.pl showing sense labels attached to the discovered clusters.

Sample LABEL files =>

1. report.pl will minimally expect LABEL in this format -
 C0 -> fine%5:00:00:elegant:00
 C1 -> fine%3:00:00::
 C2 -> fine%5:00:00:superior:02
 C3 -> fine%5:00:00:satisfactory:00
 C4 -> fine%5:00:00:thin:01

report will only read those lines from LABEL file that contain right arrow (->), all other lines will be ignored.

Lines containing '->' should show the cluster id on the left of the arrow and a sense tag on the right.

2.
 ClusterID -> SenseID
 0 -> fine%5:00:00:elegant:00
 1 -> fine%3:00:00::
 2 -> fine%5:00:00:superior:02
 3 -> fine%5:00:00:satisfactory:00
 4 -> fine%5:00:00:thin:01
 Score = 60.00

Shows the actual output of label which contains a descriptive header line on the 1st line and the score of the mapping scheme on the last line.

PRELABEL

Should be an output created by cluto2label.pl program showing the distribution of instances from each sense class in each of the clusters.

This distribution should be shown in a cluster by sense matrix where the rows represent the clusters and the columns represent the senses. Cell entry at CS[i][j] shows the number of instances belonging to cluster Ci that have the true true sense tag Sj.

e.g.

 0
 //phone        cord    txt     div     form
 0              2       1       0       0
 0              1       2       4       0
 5              2       2       25      4
 0              1       9       0       0
 0              1       0       1       0

Note that -

1. 1st line shows the number of instances unclustered.
2. 2nd line starts with // and shows the sense labels of corresponding columns.
3. 3rd line and onwards show the cluster by sense distribution matrix.

Optional Arguments:

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

Output will display a confusion table whose rows represent the discovered clusters and columns represent the actual sense classes such that cell value at (i,j) indicates the number of instances belonging to cluster Ci that have true sense id Sk where Sk is the column label on the top of the jth column. Columns are reordered such that the sense representing the rth column most accurately represents the rth cluster Cr and diagonal value at (r,r) shows the number of instances in the rth cluster that belong to their correct sense class.

When #clusters > #senses, clusters that aren't assigned a sense tag will have star (*) on them. When #senses > #clusters, senses that aren't assigned to any cluster will be hash (#) marked.

The sum of the diagonal entries shows the total number of instances that are correctly discriminated(#hits). From this number, report computes precision and recall where

 precision = #hits / #clustered

#clustered = Number of instances clustered = total #instances - #instances that belong to the unlabelled clusters - #thrown shown in PRELABEL input file.

 recall = #hits / #total instances

Sample Output :

        S1      S2      S0      S3      TOTAL
 C0:     221     11      3       15      250     (5.71)
 C1:     295     395     448     144     1282    (29.28)
 C6:     430     233     441     68      1172    (26.77)
 C9:     145     44      149     105     443     (10.12)
 C2:*    0       1       135     2       138     (3.15)
 C3:*    138     4       4       2       148     (3.38)
 C4:*    0       0       182     0       182     (4.16)
 C5:*    2       6       150     6       164     (3.75)
 C7:*    41      159     99      97      396     (9.05)
 C8:*    0       0       203     0       203     (4.64)
        1272    853     1814    439     4378
        (29.05) (19.48) (41.43) (10.03)
 Precision = 36.92(1162/3147)
 Recall = 26.54(1162/4378+0)

 Legend of Sense Tags
 S0 = SERVE10
 S1 = SERVE12
 S2 = SERVE2
 S3 = SERVE6

shows

1. 9 clusters(C0-C8) and 4 senses (S0-S3).
2. Cluster C0 represents sense S1 which stands for actual sense SERVE12
 C1 represents S2 (stands for SERVE2),
 C6 represents S0 (stands for SERVE10)
 C9 represents S3 (stands for SERVE6)
3. The above maximal mapping gives precision of 36.92% and recall of 26.54% where total 1162 instances are correctly discriminated among the total 4378 instances.
4. The last two columns show the total number and percentage of instances in each cluster(row marginal totals) while the last two rows indicate the total number and percentage of instances in each sense class(column marginal totals).

AUTHORS ^

 Ted Pedersen, University of Minnesota, Duluth

 Amruta Purandare, University of Pittsburgh

 Anagha Kulkarni, Carnegie-Mellon University

COPYRIGHT ^

Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: