Ted Pedersen > Text-SenseClusters-1.01 > clusterlabeling.pl

Download:
Text-SenseClusters-1.01.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source   Latest Release: Text-SenseClusters-1.03

NAME ^

clusterlabeling.pl - Label discovered clusters based on their content

SYNOPSIS ^

 clusterlabeling.pl [OPTIONS] INPUTFILE

DESCRIPTION ^

Assigns labels to each cluster with the significant word pairs found in the cluster contexts. Also separates the clusters in different files. This is particularly useful for the web-interface.

Two types of labels are assigned to each cluster : Descriptive and Discriminating. Descriptive labels are the top n significant word pairs. Discriminating labels are the word-pairs unique to the cluster out of the top n significant word-pairs for the cluster.

Required Arguments:

INPUTFILE

File created by Toolkit/evaluate/format_clusters.pl with --context option.

Optional Arguments:

--token TOKEN

A file containing Perl regex/s that define the tokenization scheme in INPUTFILE file.

If --token is not specified, default token regex file token.regex is searched in the current directory.

--prefix PRE

Specify a prefix to be used for the file names of the cluster files. e.g. If the PRE is the prefix specified then cluster with id=0 will have file name: PRE.cluster.0

If prefix is not specified then prefix is created by concatenating time stamp to the string "expr".

--stop STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

STOPFILE could be specified with two modes :

AND mode ignores word pairs in which both words are stop words.

OR mode ignores word pairs in which either word is a stop word.

--remove N

Removes bigrams that occur less than N times.

Default value for this option is 5

--window W

Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram features.

Default window size is 2 which allows only consecutive word pairs.

--stat STAT

Specifies the statistical scores of association. The following are available:

                ll              -       Log Likelihood Ratio [default]
                pmi             -       Point-Wise Mutual Information
                tmi             -       True Mutual Information
                x2              -       Chi-Squared Test
                phi             -       Phi Coefficient
                tscore          -       T-Score
                dice            -       Dice Coefficient
                odds            -       Odds Ratio
                leftFisher      -       Left Fisher's Test
                rightFisher     -       Right Fisher's Test

--rank R

Word pairs ranking below R when arranged in descending order of their test scores are ignored.

Default value for this option is 10

--newLine

If turned on, word pair selection process will not span across newlines.

By default this option is turned off, that is, word pair selection spans across lines.

Other Options :

--help

Displays the quick summary of program options.

--version

Displays the version information.

--verbose

Displays to STDERR the current program status.

OUTPUT ^

1. Cluster ids followed by the assigned labels are directed to STDOUT:
 Cluster 0 (Descriptive): Bill Clinton, Mariana Islands, Northern Mariana, Pacific island, World Cup, per hour

 Cluster 0 (Discriminating): Mariana Islands, Northern Mariana, Pacific island, World Cup, per hour

 Cluster 2 (Descriptive): Bill Clinton, Erik wrote, Inc Within, Jersey And, Lyle Menendez

 Cluster 2 (Discriminating): Erik wrote, Inc Within, Jersey And, Lyle Menendez

 Cluster 1: 

 Cluster 3:

 Cluster -1 (Descriptive): York Times, Undated _
 
 Cluster -1 (Discriminating): York Times, Undated _
2. Cluster files, named with the specified prefix or the generated prefix.

SYSTEM REQUIREMENTS ^

Input to this program should be created by format_clusters.pl

BUGS ^

AUTHOR ^

 Anagha Kulkarni, Carnegie-Mellon University

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT ^

Copyright (c) 2004-2008, Anagha Kulkarni and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: