Ted Pedersen > Text-SenseClusters-1.03 > cluto2label.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

cluto2label.pl - Convert Cluto output to a confusion matrix

SYNOPSIS ^

 cluto2label.pl [OPTIONS] CLUTO KEY

SYNOPSIS ^

Converts Cluto's clustering solution file to a cluster by sense distribution matrix to then be input to SenseClusters evaluation program label.pl.

INPUT ^

Required Arguments:

CLUTO

1st argument should be a clustering solution file (described in section 3.4.1 on page 34 in Cluto's manual) as created by Cluto's scluster and vcluster programs.

For N instances, CLUTO file will have exactly N lines, each ith line showing the cluster number(start from 0) to which the ith instance belongs.

e.g.

Cluto's clustering solution file =>

 0
 1
 1
 2
 0
 0
 1
 2

shows the cluster ids of each of the 8 instances clustered by Cluto's program.

 1st, 5th and 6th instance belong to 1st cluster (Cluster No 0)

 2nd, 3rd and 7th instance belong to 2nd cluster (Cluster No 1)

And

 4th and 8th instance belong to 3rd cluster (Cluster No 2)

Note: cluster id could be possibly -1 which means the corresponding instance is not assigned to any cluster

KEY

2nd argument should be a KEY file (in SenseCluster's format) showing true sense class labels of instances listed in CLUTO.

For N lines in file CLUTO, KEY should have exactly N lines. Each ith line in KEY should minimally show a space separated list of true sense labels of ith instance in following format -

        <sense id="S"/>+

e.g.

 <sense id="art2"/> <sense id="art4"/>
 <sense id="art1"/>
 <sense id="art3"/><sense id="art4"/>
 <sense id="art3"/>
 <sense id="art4"/> <sense id="art1"/>
 <sense id="art1"/>
 <sense id="art5"/> <sense id="art2"/> <sense id="art3"/>
 <sense id="art2"/> <sense id="art4"/>

Shows the true sense ids of instances in the CLUTO file described in (1).

If KEY is an actual KEY created by SenseClusters programs, KEY will also show the instance ids of corresponding instances in the beginning of each line.

e.g.

 <instance id="line-n.w7_098:6515:"/> <sense id="art2"/> <sense id="art4"/>

 <instance id="line-n.w8_083:14771:"/> <sense id="art1"/>

 <instance id="line-n.art} aphb 02700649:"/> <sense id="art3"/><sense id="art4"/>

 <instance id="line-n.art} aphb 53900889:"/> <sense id="art3"/>

 <instance id="line-n.w7_066:11025:"/> <sense id="art4"/> <sense id="art1"/>

 <instance id="line-n.art} aphb 42100373:"/> <sense id="art1"/>

 <instance id="line-n.w8_109:8774:"/> <sense id="art5"/> <sense id="art2"/> <sense id="art3"/>

 <instance id="line-n.w7_004:10784:"/> <sense id="art2"/> <sense id="art4"/>

Optional Arguments:

--numthrow N

Ignores clusters containing less than N instances.

--perthrow P

Ignores clusters containing less than P percent of the instances.

Number of instances contained in the thrown clusters will be counted as the unclustered instances.

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

This will show

SYSTEM REQUIREMENTS ^

Cluto - http://www-users.cs.umn.edu/~karypis/cluto/

AUTHORS ^

 Amruta Purandare, University of Pittsburgh

 Ted Pedersen,  University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT ^

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: