#!/usr/local/bin/perl -w
=head1 NAME
discriminate.pl Wrapper program to run SenseClusters in a single command
=head1 SYNOPSIS
Discriminates among the given text instances based on their contextual
similarities.
=head1 USAGE
discriminate.pl [OPTIONS] TEST
=head1 INPUT
=head2 Required Arguments:
=head3 TEST
Senseval-2 formatted TEST instance file that contains the instances
to be clustered.
=head2 Optional Arguments:
=head3 DATA OPTIONS :
=head4 --training TRAIN
Training file in plain text format that can be used to select features.
If this is not specified, features are selected from the given TEST file.
=head4 --split N
Splits the given TEST file into two portions, N% for the use as the TRAIN
data and (100-N)% as the TEST data. The value for N is a percentage and
should be an integer between 1 to 99 (inclusive). The instances from the
original TEST file are not picked or split in any particular order but are
randomly split into the two portions of TRAIN and TEST data while maintaining
the ratio of N/(100-N).
Note: This option cannot be used when --training option is also used.
=head4 --token TOKEN
A file containing Perl regex/s that define the tokenization scheme in TRAIN
and TEST files. If --token is not specified, default token regex file
token.regex is searched in the current directory.
=head4 --target TARGET
A file containing Perl regex/s for identifying the target word. A sample
target.regex file containing regex:
/<head>\w+</head>/
is provided with this distribution. If --target is not specified, default
target regex file target.regex is searched in the current directory.
If this file doesn't exist, target.regex is automatically created by finding
all instances of <head> tags from the TEST data. If there are no instances
of <head> tags in TEST, the given data is assumed to be global and target
word is not searched in either TRAIN or TEST.
Note: --target cannot be specified with headless input data
i.e. test file without head/target word(s).
=head4 --prefix PRE
Specify a prefix to be used in all output file names. e.g. context vector
file will have name 'PRE.vectors', features file will have name 'PRE.features'
and so on ... By default, a random prefix is created using the time stamp.
=head4 --format f16.XX
The default format for floating point numbers is f16.06. This means that
there is room for 6 digits to the right of the decimal, and 9 to the
left. You may change XX to any value between 0 and 15, however, the
format must remain 16 spaces long due to formatting requirements of SVDPACKC.
=head4 --wordclust
Discriminates and clusters each word based upon its direct and indirect
co-occurrence with other words (when used without the --lsa switch) or
clusters words or features based upon their occurrences in different contexts
(when used with the --lsa switch).
Note: 1. Separate (--training) TRAIN data should not be used with word
clustering.
2. Starting with Version 0.93, word clustering is no longer
restricted to using only headless data. However, options
specific to headed data such as --scope_test and target
co-occurrence features (see below) cannot be used.
=head4 --lsa
Uses Latent Semantic Analysis (LSA) style representation for clustering
features or contexts. LSA representation is the transpose of
the context-by-feature matrix created using the native SenseClusters
order1 context representation.
This option can be used only in the following two combinations of
the --context and the --wordclust options:
=over
=item 1. --context o1 --wordclust --lsa
Performs feature clustering by grouping together features based on the
contexts that they occur in. Features can be unigrams, bigrams or
co-occurrences. Feature vectors are the rows of the transposed
context-by-feature representation created by order1vec.pl.
=item 2. --context o2 --lsa
Performs context clustering by creating context vectors by averaging the
feature vectors from the transposed context-by-feature representation of
order1vec.pl.
=back
=head3 FEATURE OPTIONS :
=head4 --feature TYPE
Specify the feature type to be used for representing contexts.
Possible options for feature type with first order context representation:
bi - bigrams [default]
tco - target co-occurrences
co - co-occurrences
uni - unigrams
Possible options for feature type with second order context representation:
bi - bigrams [default]
co - co-occurrences
tco - target co-occurrences
Note: --tco (target co-occurrences) cannot be used with headless
data i.e. test/train file without head/target word(s).
=head4 --scope_train S1
Limits the scope of the training contexts to S1 words around (on both
sides of) the TARGET word. Thus, it allows selection of local features.
If --scope_train is used, each training instance is expected to include
the target word as specified by the --target option or default target.regex.
Note: --scope_train cannot be used with headless data i.e. train files
without head/target word(s).
=head4 --scope_test S2
Limits the scope of the test contexts to S2 words around (on both sides of)
the TARGET word. Thus, it allows to match and use local features in the
context vectors.
Note: --scope_test cannot be used with headless data i.e. test files
without head/target word(s).
=head4 --stop STOPFILE
A file of Perl regexes that define the stop list of words to be excluded from
the features.
STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the
STOPFILE.
- ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the
STOPFILE.
- ignores word pairs in which either word is a stop word.
Both modes exclude stop words from unigram features.
Default is OR mode.
=head4 --remove F
Removes features that occur less than F times in the training corpus.
=head4 --window W
Specifies the window size for bigram/co-occurrence features. Pairs of words
that co-occur within the specified window from each other (window W allows at
most W-2 intervening words) will form the bigram/co-occurrence features.
Default window size is 2 which allows only consecutive word pairs.
Not applicable to unigram features.
=head4 --stat STAT
Bigrams and co-occurrences can be selected based on their statistical scores
of association as specified by this option. If --vector = o2 and
--stat is used, word association matrix will use the scores computed by the
specified statistical test instead of simple joint frequency counts of the
word pairs.
Available tests of association are :
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
By default, features are selected and represented using their frequency
counts.
=head4 --stat_rank N
Word pairs ranking below N when arranged in descending order of their test
scores are ignored.
--stat_rank has no effect unless --stat is specified.
=head4 --stat_score S
Selects word pairs with scores greater than S after performing the selected
test of association. Score could be any real number that will give reasonable
number of features for the requested test.
--stat_score has no effect unless --stat is specified.
=head3 VECTOR OPTIONS :
=head4 --context ORD
Specifies the context representation to be used. Set ORD to 'o1' to use
1st order context vectors, and to 'o2' to select 2nd order context vectors.
Default context representation is o2.
=head4 --binary
Creates binary feature and context vectors. By default, feature vectors
show the joint frequency scores of the associated word pairs while the
context vectors show the average of the feature vectors of words that occur
in the context. With --binary turned ON, feature vectors show mere presence or
absence of the particular word pair (co-occurrence/bigram) in TRAIN,
while the context vectors will represent a binary 'OR' operation on the
corresponding vectors of contextual features.
=head3 SVD OPTIONS :
=head4 --svd
Reduces the feature space dimensions by performing Singular Value Decomposition
(SVD). By default, all feature dimensions are retained.
=head4 --k K
Reduces the dimensions of the feature space to K. Default K = 300
=head4 --rf RF
Specifies the scaling factor for reducing feature space dimensions such that
feature space with N dimensions is reduced down to N/RF. Default RF = 4.
RF should be an integer greater than 1.
If both --k and --rf are specified, dimensions are reduced to min(k,N/RF).
Note: If the reduced dimensions ( min(k,N/RF) ) turn-out to be less than
or equal to 10 then svd is not performed.
=head4 --iter I
Specifies the number of iterations of SVD. Recommended value is 3 times
the desired K.
=head3 CLUSTER-STOPPING OPTIONS:
=head4 --cluststop CS
Specifies the cluster stopping measure to be used to predict the number
the number of clusters.
The possible option values:
pk1 - Use PK1 measure [ PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM])) ]
pk2 - Use PK2 measure [ PK2[m] = (crfun[m]/crfun[m-1]) ]
pk3 - Use PK3 measure [ PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1])) ]
gap - Use Adapted Gap Statistic.
pk - Use all the PK measures.
all - Use all the four cluster stopping measures.
More about these measures can be found in the documentation of
Toolkit/clusterstop/clusterstopping.pl
NOTE: Options --cluststop and --clusters (described under Clustering options) cannot be used together.
=head4 --delta INT
NOTE: Delta value can only be a positive integer value.
Specify 0 to stop the iterating clustering process when two consecutive crfun values
are exactly equal. This is the default setting when the crfun values are integer/whole numbers.
Specify non-zero positive integer to stop the iterating clustering process when the difference
between two consecutive crfun values is less than or equal to this value. However, note that the
integer value specified is internally shifted to capture the difference in the least significant
digit of the crfun values when these crfun values are fractional.
For example:
For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001
For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005
The default delta value when the crfun values are fractional is 1.
However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta
value is internally shifted only until the least significant digit in the scientific notation.
For example:
For crfun = 1.23e+04 & delta = 2 will be transformed to 200
For crfun = 2.45e+02 & delta = 5 will be transformed to 5
For crfun = 1.44e+03 & delta = 1 will be transformed to 10
=head4 --threspk1 NUM
Specifies the threshold value that should be used by the PK1 measure to predict the k value.
Default = -0.7
NOTE: This option should be used only when --cluststop option is also used
with option value of "all" or "pk1".
=head3 CLUSTER-STOPPING: ADAPTED GAP STATISTIC OPTIONS:
=head4 --B NUM
The number of replicates/references to be generated.
Default: 1
=head4 --typeref TYP
Specifies whether to generate B replicates from a reference or to generate
B references.
The possible option values:
rep - replicates [Default]
ref - references
=head4 --percentage NUM
Specifies the percentage confidence to be reported in the log file.
Since Gap Statistic uses parametric bootstrap method for reference distribution
generation, it is critical to understand the interval around the sample mean that
could contain the population ("true") mean and with what certainty.
Default: 90
=head4 --seed NUM
The seed to be used with the random number generator.
Default: No seed is set.
=head3 CLUSTERING OPTIONS :
=head4 --clusters N
Specifies number of clusters to be created. Default is set to 2.
=head4 --space SPACE
Specifies whether clustering is to be performed in vector or similarity space.
Set the value of SPACE to 'vector' to perform clustering in vector space i.e.
to cluster the context vectors directly. To cluster in similarity space
by explicitly finding the pair-wise similarities among the contexts,
set SPACE to 'similarity'.
By default, clustering is performed in vector space.
=head4 --clmethod CL
Specifies the clustering method.
Possible option values are :
rb - Repeated Bisections [Default]
rbr - Repeated Bisections for by k-way refinement
direct - Direct k-way clustering
agglo - Agglomerative clustering
graph - Graph partitioning-based clustering
bagglo - Partitional biased Agglomerative clustering
For large amount of data, 'rb', 'rbr' or 'direct' are recommended.
=head4 --crfun CR
Selects the criteria function for Clustering. The meanings of these criteria
functions are explained in Cluto's manual.
The possible values are:
i1 - I1 Criterion function
i2 - I2 Criterion function [default for partitional]
e1 - E1 Criterion function
g1 - G1 Criterion function
g1p - G1' Criterion function
h1 - H1 Criterion function
h2 - H2 Criterion function
slink - Single link merging scheme
wslink - Single link merging scheme weighted w.r.t. cluster sim
clink - Complete link merging scheme
wclink - Complete link merging scheme weighted w.r.t. cluster sim
upgma - Group average merging scheme [default for agglomerative]
Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions
can only be used. If a crfun other than these is selected then cluster
stopping uses the default crfun (i2) while the final clustering of contexts
is performed using the crfun specified.
=head4 --sim SIM
Specifies the similarity measure to be used for either vector or similarity
space clustering.
When --space = vector (or default), possible values of SIM are :
cos - Cosine [default]
corr - Correlation Coefficient
dist - Euclidean distance
jacc - Extended Jaccard Coefficient
When --space = similarity and --binary is ON, possible values of SIM are -
cos - Cosine [default]
mat - Match
jac - Jaccard
ovr - Overlap
dic - Dice
Otherwise, only cosine measure is available and is default.
The following table summarizes availability of similarity measures
for 2 clustering approaches - vector(vcl) and similarity(scl) and
on 2 different types of context vectors - binary Vs frequency
vcl+bin vcl+freq scl+bin scl+freq
cos Y Y Y Y
mat N N Y N
jacc Y Y Y N
dice N N Y N
ovr N N Y N
dist Y Y N N
corr Y Y N N
The reasons are purely implementation issues and in future, we plan to support
more consistent measures across these combinations.
=head4 --rowmodel RMOD
The option is used to specify the model to be used to scale every
column of each row. (For further details please refer Cluto manual)
The possible values for RMOD -
none - no scaling is performed (default setting)
maxtf - post scaling the values are between 0.5 and 1.0
sqrt - square-root of actual values
log - log of actual values
=head4 --colmodel CMOD
The option is used to specify the model to be used to (globally) scale each
column across all rows. (For further details please refer Cluto manual)
The possible values for CMOD -
none - no scaling is performed (default setting)
idf - scaling according to inverse-document-frequency
=head3 LABELING OPTIONS :
Note: Labeling options cannot be used with word-clustering (--wordclust).
=head4 --label_stop LABEL_STOPFILE
A file of Perl regexes that define the stop list of words to be
excluded from the features.
LABEL_STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the
LABEL_STOPFILE
- ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the
LABEL_STOPFILE
- ignores word pairs in which either word is a stop word.
Default is OR.
=head4 --label_ngram LABEL_NGRAM
Specifies the value of n in 'n-gram' for the feature selection.
The supported values for n are 2, 3 and 4.
Default value is 2 i.e. bigram.
=head4 --label_remove LABEL_N
Removes ngrams that occur less than LABEL_N times.
=head4 --label_window LABEL_W
Specifies the window size for bigrams. Pairs of words that co-occur
within the specified window from each other (window LABEL_W allows at most
LABEL_W-2 intervening words) will form the bigram features.
Default window size is 2 which allows only consecutive word pairs.
=head4 --label_stat LABEL_STAT
Specifies the statistical scores of association.
Available tests of association are :
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
=head4 --label_rank LABEL_R
Word pairs ranking below LABEL_R when arranged in descending order of
their test scores are ignored.
=head3 Other Options :
=head4 --eval
Evaluates clustering performance by computing precision and recall for maximally
accurate assignment of sense tags to clusters. Maximal Assignment is when
clusters are given sense labels such that maximum number of instances will be
attached with their true sense tags.
TEST instances tagged with multiple senses are automatically attached with the
single sense-tag that is the most frequent among the attached tags.
Note: This option can be used only if the answer tags are provided in the TEST file.
=head4 --rank_filter R
Allows to remove low frequency senses during evaluation. This will
remove the senses that rank below R when senses in TEST are arranged
in the descending order of their frequencies. In other words, it
selects top R most frequent senses. An instance will be removed if
it has all sense tags below rank R.
=head4 --percent_filter P
Allows to remove low frequency senses based on their percentage
frequencies. This will remove senses whose frequency is below P%
in the TEST data.
If rank or percent filters are specified, they are applied after removing
the multiple sense tags.
=head4 --help
Displays the quick summary of program options.
=head4 --version
Displays the version information.
=head4 --verbose
Displays to STDERR the current program status.
=head4 --showargs
Displays to STDOUT values of compulsory and required parameters.
[NOT SUPPORTED IN THIS VERSION]
=head1 OUTPUT
discriminate.pl creates several output files. The discrimination of contexts
performed by discriminate.pl, (i.e., a cluster assigned to each context) is given
by the file $PREFIX.clusters if the number of clusters was set manually, otherwise
by the file $PREFIX.clusters.$CLUSTSTOP where the $CLUSTSTOP specifies the cluster
stopping measure that was used to predict the number of clusters.
In addition, discriminate.pl also creates following files:
NOTE: If a cluster stopping measure was used then it is indicated in the names of
several output files by appending the cluster stopping measure name with the
file name. Represented below as filename[.$CLUSTSTOP]
=over
=item * $PREFIX.clusters_context[.$CLUSTSTOP] - File containing all the input instances grouped by the cluster-id assigned to them.
=item * $PREFIX[.$CLUSTSTOP].cluster.CLUSTERID - All the identified clusters and their instances are separated into different files. The filenames end with the cluster-id. e.g.: File containing instances of cluster 0 will be named as $PREFIX.cluster.0
=item * $PREFIX.report[.$CLUSTSTOP] - Confusion table if --eval is ON
=item * $PREFIX.cluster_labels[.$CLUSTSTOP] - List of labels (word-pairs) assigned to each cluster.
=item * $PREFIX[.$CLUSTSTOP].dendogram.ps - Dendograms + some information.
=item * $PREFIX.features - Features file
=item * $PREFIX.regex - File containing regular expressions for identifying
the features listed in $PREFIX.features file.
=item * $PREFIX.testregex - File containing only those regular expressions from
the $PREFIX.regex file above, which match at least once in the test contexts,
only created in second order context clustering mode (SC native as well as LSA)
and LSA feature clustering mode
=item * $PREFIX.wordvec - Word Vectors if --context = o2
=item * $PREFIX.vectors - Context Vectors
=item * $PREFIX.rlabel - Row Labels of $PREFIX.vectors
=item * $PREFIX.clabel - Column Labels of $PREFIX.vectors
=item * $PREFIX.rclass - Class Ids of $PREFIX.vectors if --eval is ON
=item * $PREFIX.cluster_solution[.$CLUSTSTOP] - Cluster ids of $PREFIX.vectors
=item * $PREFIX.cluster_output[.$CLUSTSTOP] - Clustering program output
=back
=head3 Cluster Stopping related output files:
=over
=item * $PREFIX.pk1 - crfun[k] values, delta values, PK1[k] values and predicted k value
=item * $PREFIX.pk2 - crfun[k] values, delta values, PK2[k] values and predicted k value
=item * $PREFIX.pk3 - crfun[k] values, delta values, PK3[k] values and predicted k value
=item * $PREFIX.gap - crfun[k] values, delta values and predicted k value
=item * $PREFIX.gap.log - Gap(k), Obs(crfun(k)), Exp(crfun(k)) values etc.
=back
=head3 The following files are created to facilitate creation of plots, if needed:
=over
=item * $PREFIX.cr.dat - value-pairs :- k-value crfun-value
=item * $PREFIX.pk1.dat - value-pairs :- k-value PK1[k] value
=item * $PREFIX.pk2.dat - value-pairs :- k-value PK2[k] value
=item * $PREFIX.pk3.dat - value-pairs :- k-value PK3[k] value
=item * $PREFIX.gap.dat - value-pairs :- k-value Gap[k] value
=item * $PREFIX.exp.dat - value-pairs :- k-value Exp(crfun[k]) value
=back
=head1 AUTHORS
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Amruta Purandare, University of Pittsburgh
Anagha Kulkarni, Carnegie-Mellon University
Mahesh Joshi, Carnegie-Mellon Unversity
=head1 COPYRIGHT
Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
=cut
###############################################################################
# THE CODE STARTS HERE
#$0 contains the program name along with
#the complete path. Extract just the program
#name and use in error messages
$0=~s/.*\/(.+)/$1/;
###############################################################################
# ================================
# COMMAND LINE OPTIONS AND USAGE
# ================================
use Math::SparseMatrix;
# use the following perl module for command line options parsing
# Do not allow abbreviations of options i.e. options have to be spelled out completely.
use Getopt::Long qw(:config no_auto_abbrev);
# command line options
# catch, abort and print the message for unknown options specified
eval(GetOptions ("help","version","training=s","token=s","target=s","stop=s","feature=s","remove=i","window=i","scope_train=i","scope_test=i","stat=s","stat_rank=i","stat_score=f","context=s","binary","svd","k=i","rf=i","iter=i","clusters=i","space=s","clmethod=s","crfun=s","sim=s","eval","verbose","showargs","prefix=s","format=s","rank_filter=i","percent_filter=f","label_ngram=n","label_window=i","label_stop=s","label_remove=i","label_stat=s","label_rank=i","wordclust","split=i","rowmodel=s","colmodel=s","cluststop=s","threspk1=f","delta=i","B=i","typeref=s","percentage=i","seed=i", "lsa")) or die("Please check the above mentioned option(s).\n");
# show help option
if(defined $opt_help)
{
$opt_help=1;
&showhelp();
exit;
}
# show version information
if(defined $opt_version)
{
$opt_version=1;
&showversion();
exit;
}
# show minimal usage message if no arguments
if($#ARGV<0)
{
&showminimal();
exit 1;
}
#############################################################################
# ================================
# INITIALIZATION AND INPUT
# ================================
# Note on ERROR message conventions - error and warning messages from
# discriminate.pl should go to STDERR, and should be intended 1 tab.
# Error messages from Toolkit programs should be indented 2 tabs.
# TDP August, 2006
# ----------
# Testfile
# ----------
if(!defined $ARGV[0])
{
print STDERR "ERROR($0):
Please specify the TEST file name...\n";
exit 1;
}
$testfile=$ARGV[0];
if(!-e $testfile)
{
print STDERR "ERROR($0):
Could not locate the TEST file <$testfile>\n";
exit 1;
}
# ---------------
# Tokenfile
# ---------------
if(defined $opt_token)
{
$token=$opt_token;
}
else
{
$token="token.regex";
}
if(!-e $token)
{
print STDERR "ERROR($0):
Could not locate the TOKEN file <$token>\n";
exit 1;
}
elsif(-z $token)
{
print STDERR "ERROR($0):
TOKEN file <$token> is empty.\n";
exit 1;
}
# ---------------
# Targetfile
# ---------------
my $target = "";
if(defined $opt_target)
{
$target=$opt_target;
if(!-e $target)
{
print STDERR "ERROR($0):
Could not locate the TARGET file <$target>\n";
exit 1;
}
}
else
{
$target="target.regex";
# this will automatically create the target.regex file
# in the current dir.
if(!-e $target)
{
$status=system("maketarget.pl -head $testfile");
die "Error while running maketarget.pl on <$testfile>\n" unless $status==0;
}
}
# --------------
# Prefix
# --------------
if(defined $opt_prefix)
{
$prefix=$opt_prefix;
}
else
{
$prefix="expr" . time();
}
# --------------
# Format
# --------------
if(defined $opt_format)
{
if ($opt_format !~/^(f16.\d\d)/)
{
print STDERR "ERROR($0):
--format must be of the form f16.XX, where 0 <= XX < 16,
not $opt_format\n";
exit 1;
}
else
{
$format=$opt_format; ## format is defined, has valid form
$format =~ /^f16.(\d\d)/;
$prec = $1; # precision
}
}
else
{
$format = "f16.06"; ## format is not defined, use default
$prec = 6;
}
# --------------
# SVD options
# --------------
if(!defined $opt_k)
{
$opt_k=300;
}
if(!defined $opt_rf)
{
$opt_rf=10;
}
# default feature
if(!defined $opt_feature)
{
$opt_feature = "bi";
}
# initialize the variable for default number of clusters
$default_clusters = 2;
# --------------
# Error checks
# --------------
if(defined $opt_space)
{
if($opt_space !~/^(vector|simil)/)
{
print STDERR "ERROR($0):
--space should be either 'vector' or 'similarity'.\n";
exit 1;
}
}
if($opt_feature !~/^(bi(gram)?|co(occur|c)?|uni(gram)?|tco(occur|c)?)/)
{
print STDERR "ERROR($0):
Specified Feature type --$opt_feature is not supported.\n";
exit 1;
}
if($opt_feature=~/^uni(gram)?/ && !defined $opt_lsa && (!defined $opt_context || $opt_context =~ /o2|order2/))
{
print STDERR "ERROR($0):
--feature cannot be $opt_feature when --context is o2,
unless --lsa is also specified\n";
exit 1;
}
if(defined $opt_split && ($opt_split >=100 || $opt_split <= 0))
{
print STDERR "ERROR($0):
The N value for the --split option should be between 1 to 99\n";
exit 1;
}
# Option validations for Word Clustering and headless input data.
# Find the type (headed/headless) of the Test and Train data
# and then handle the following cases:
# Case 1: train headless / test headed
# Case 2: train headless / test headless
# Case 3: train headed / test headed
# Case 4: train headed / test headless
my $TestType = 0; # By default headed
my $TrainType = 0; # By default headed
# check the Test data for <head> tag
open (INP,$testfile) || die "Error($0):
Error(code=$!) in opening <$testfile> file.\n";
# read the complete file in single instruction instead of reading line by line.
my $temp_delimiter = $/;
$/ = undef;
my $inp_str = <INP>;
$/ = $temp_delimiter;
close INP;
# If the --eval option specified Then check if answer tags present
if(defined $opt_eval)
{
if($inp_str !~ m/<answer/)
{
print STDERR "ERROR($0):
The --eval option cannot be used with unlabeled data.
In other words, the experiment cannot be evaluated if
the input Senseval2 file does not contain the answer tags.\n";
exit 1;
}
}
if($inp_str !~ m/<head>.+<\/head>/i)
{
$TestType = 1; # headless
}
# when separate training data specified
if(defined $opt_training)
{
# training data cannot be provided word clustering.
if(defined $opt_wordclust)
{
if (defined $opt_lsa)
{
print STDERR "ERROR($0):
--training option cannot be used with feature-clustering.\n";
}
else
{
print STDERR "ERROR($0):
--training option cannot be used with word-clustering.\n";
}
exit 1;
}
# check if the training file exists
if(!-e $opt_training)
{
print STDERR "ERROR($0):
Could not locate the TRAIN file <$opt_training>\n";
exit 1;
}
# check if the training file is a text file.
if(!-T $opt_training)
{
print STDERR "ERROR($0):
Training file has to be a plain text file.
The provided file is not a text file. \n";
exit 1;
}
open (INP,$opt_training) || die "Error($0):
Error(code=$!) in opening <$opt_training> file.\n";
# read the complete file in single instruction instead of reading line by line.
my $temp_delimiter = $/;
$/ = undef;
my $inp_str = <INP>;
$/ = $temp_delimiter;
close INP;
# check if the training file is senseval2 formatted file - if yes quit.
if($inp_str =~ m/<corpus/i && $inp_str =~ m/<lexelt/i && $inp_str =~ m/<instance/i && $inp_str =~ m/<context/i)
{
print STDERR "ERROR($0):
Training file has to be a plain text file.
The provided file is a Senseval2 formatted file. \n";
exit 1;
}
# check if the training file is html formatted file - if yes quit.
if($inp_str =~ m/<html/i)
{
print STDERR "ERROR($0):
Training file has to be a plain text file.
The provided file is a html formatted file. \n";
exit 1;
}
# check if the training file is xml formatted file - if yes quit.
if($inp_str =~ m/<\?xml/i)
{
print STDERR "ERROR($0):
Training file has to be a plain text file.
The provided file is an xml formatted file. \n";
exit 1;
}
# checks the Train data for <head> tag
if($inp_str !~ m/<head>.+<\/head>/i)
{
$TrainType = 1;
}
}
else # Test data to be used as Train data thus $TrainType = $TestType
{
$TrainType = $TestType;
}
# scope cannot be used with headless training data
if (defined $opt_scope_train && $TrainType == 1) {
print STDERR "ERROR($0):
--scope_train option cannot be used
when the Train data is headless.\n";
exit 1;
}
# scope cannot be used with headless test data, or when word clustering
# is requested
if (defined $opt_scope_test && ($TestType == 1 || defined $opt_wordclust)) {
print STDERR "ERROR($0):
--scope_test option cannot be used when the Test data
is headless or when word clustering is requested.\n";
exit 1;
}
# word-clustering is treated as headless type of clustering thus
# 1. check for target co-occurrence
# 2. target option
if(defined $opt_wordclust)
{
# we do no allow tco as the feature type, headed data is allowed but
# <head>...</head> is simply a normal token in this case
if(defined $opt_feature && $opt_feature eq "tco")
{
print STDERR "ERROR($0):
target co-occurrences (tco) cannot be used as the feature
type with word-clustering.\n";
exit 1;
}
# headless case which cannot allow target file option
if(defined $opt_target)
{
print STDERR "ERROR($0):
--target option cannot be used with word-clustering.\n";
exit 1;
}
}
# --lsa cannot be used in o1 context representation, unless word clustering is specified
if (defined $opt_lsa) {
if($opt_context =~ /o1|order1/ && !defined $opt_wordclust) {
print STDERR "ERROR($0):
--lsa option cannot be used with --context o2
without specifying the --wordclust option\n";
exit 1;
}
if((!defined $opt_context || $opt_context =~ /o2|order2/) && defined $opt_wordclust) {
print STDERR "ERROR($0):
--lsa option can be used either with \"--context o2\"
(the default) or with \"--context o1 --wordclust\" options,
but not with \"--context o2 --wordclust\".\n";
exit 1;
}
}
# Case 1: train headless / test headed
if($TrainType == 1 && $TestType == 0)
{
# headed case which cannot allow tco as the feature type
if(defined $opt_feature && $opt_feature eq "tco")
{
print STDERR "ERROR($0):
target co-occurrences (tco) cannot be used as the feature
type when the Test/Train data is headless.\n";
exit 1;
}
}
# Case 2: train headless / test headless And
# Case 4: train headed / test headless
if(($TrainType == 1 && $TestType == 1) || ($TrainType == 0 && $TestType == 1))
{
# headless case which cannot allow target file option
if(defined $opt_target)
{
print STDERR "ERROR($0):
--target option cannot be used with headless clustering.\n";
exit 1;
}
# headless case which cannot allow tco as the feature type
if(defined $opt_feature && $opt_feature eq "tco")
{
print STDERR "ERROR($0):
target co-occurrences (tco) cannot be used as the feature type
when the Test/Train data is headless.\n";
exit 1;
}
}
# Case 3: train headed / test headed
# No Special error checks required
# Check if Test and Train specified by user and --split option is also used
if(defined $opt_training && defined $opt_split)
{
print STDERR "ERROR($0):
Cannot use --split option to split the input data
into Test and Train portions if separate Training data
(--training) is alredy specified.\n";
exit 1;
}
# if space is vector and clmethod is graph then only can
# jacc and dist similarity measures be used.
if((!defined $opt_space || $opt_space eq "vector") && (!defined $opt_clmethod || $opt_clmethod ne "graph") && defined $opt_sim && ($opt_sim eq "dist" || $opt_sim eq "jacc"))
{
print STDERR "ERROR($0):
Similarity Measures (--sim) Euclidean distance and Jaccard can
only be used if Clustering Method(--clmethod graph) is Graph
and Clustering Space (--space vector) is Vector.\n";
exit 1;
}
if(defined $opt_space && $opt_space eq "similarity" && !defined $opt_binary && defined $opt_sim && $opt_sim ne "cos")
{
print STDERR "ERROR($0):
Only Cosine Similarity Measure (--sim cos) is a valid option
if Clustering space is similarity (--space similarity)
and --binary option is not ON.\n";
exit 1;
}
if(defined $opt_space && $opt_space eq "similarity" && defined $opt_clmethod && $opt_clmethod eq "bagglo")
{
print STDERR "ERROR($0):
Partitional biased Agglomerative clustering (--clmethod bagglo)
available only for vector space.\n";
exit 1;
}
if(defined $opt_clmethod && $opt_clmethod ne "agglo" && defined $opt_crfun && ($opt_crfun eq "slink" || $opt_crfun eq "wslink" || $opt_crfun eq "clink" || $opt_crfun eq "wclink" || $opt_crfun eq "upgma"))
{
print STDERR "ERROR($0):
$opt_crfun Criterion Function (--crfun $opt_crfun) valid only if
Clustering Method is agglomerative (--clmethod agglo). \n";
exit 1;
}
# Error Checks for the rowmodel and colmodel options of Cluto
if(defined $opt_rowmodel && $opt_rowmodel !~/^(none|maxtf|sqrt|log)$/)
{
print STDERR "ERROR($0):
Specified rowmodel value: $opt_rowmodel is not supported.\n";
exit 1;
}
if(defined $opt_space && $opt_space eq "similarity" && defined $opt_rowmodel)
{
print STDERR "ERROR($0):
--rowmodel option can be used only in vector space. \n";
exit 1;
}
if(defined $opt_colmodel && $opt_colmodel !~/^(none|idf)$/)
{
print STDERR "ERROR($0):
Specified colmodel value: $opt_colmodel is not supported.\n";
exit 1;
}
if(defined $opt_space && $opt_space eq "similarity" && defined $opt_colmodel)
{
print STDERR "ERROR($0):
--colmodel option can be used only in vector space. \n";
exit 1;
}
# cluster stopping related initializations and error checks
# if neither #clusters nor cluster-stopping measure specified
if(!defined $opt_clusters && !defined $opt_cluststop)
{
$opt_clusters = $default_clusters;
}
if(defined $opt_clusters && defined $opt_cluststop)
{
print STDERR "ERROR($0):
--clusters and --cluststop options cannot be used together. \n";
exit 1;
}
if(defined $opt_cluststop && $opt_cluststop !~ /^(all|pk|pk1|pk2|pk3|gap)$/i)
{
print STDERR "ERROR($0):
$opt_cluststop not a valid option value for --cluststop. \n";
exit 1;
}
if(!defined $opt_cluststop && defined $opt_threspk1)
{
print STDERR "ERROR($0):
--threspk1 option can be used only when using --cluststop option. \n";
exit 1;
}
if(!defined $opt_cluststop && defined $opt_delta)
{
print STDERR "ERROR($0):
--delta option can be used only when using --cluststop option. \n";
exit 1;
}
if(defined $opt_typeref && $opt_typeref !~ /^(rep|ref)$/i)
{
print STDERR "ERROR($0):
$opt_typeref not a valid option value for --typeref. \n";
exit 1;
}
if(defined $opt_percentage && ($opt_percentage < 0 || $opt_percentage > 100))
{
print STDERR "ERROR($0):
The value for --percentage must be in the range [0,100] (inclusive).\n";
exit 1;
}
##############################################################################
# -------------------------
# Preprocessing
# -------------------------
if(defined $opt_verbose)
{
print STDERR "Preprocessing the input data ...\n";
}
# if TEST contains actual sense tags,
# filter TEST to remove multiple
# senses / instance
if(defined $opt_eval)
{
# removing multiple senses of TEST instances
$test_report="$prefix.test_report";
$status=system("frequency.pl $testfile > $test_report");
die "Error while running frequency.pl on <$testfile>\n" unless $status==0;
$status=system("filter.pl --percent 0 --nomulti $testfile $test_report > $testfile.nomulti");
die "Error while running filter.pl on <$testfile>\n" unless $status==0;
# applying filters now
if(defined $opt_rank_filter || defined $opt_percent_filter)
{
if(defined $opt_verbose)
{
print STDERR "Removing Low Frequency Senses from TEST ...\n";
}
if(defined $opt_rank_filter)
{
$filter_string="--rank $opt_rank_filter ";
}
else
{
$filter_string="--percent $opt_percent_filter ";
}
$status=system("filter.pl $filter_string $testfile.nomulti $test_report > $testfile.filtered");
die "Error while running filter.pl on <$testfile.nomulti>\n" unless $status==0;
$testfile="$testfile.filtered";
}
else
{
$testfile="$testfile.nomulti";
}
}
if(defined $opt_training)
{
$train_plain=$opt_training;
$tmp_testfile = "$testfile.pro";
$status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --nocount $testfile");
die "Error in running preprocess.pl on <$testfile>\n" unless $status==0;
$testfile = $tmp_testfile;
}
else
{
if(defined $opt_split)
{
# convert test in sval2 to plain, process the test file and also split the data
$train_plain="$prefix.train_plain";
$tmp_testfile = "$testfile.pro";
$status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --count $train_plain --split $opt_split $testfile");
die "Error in running preprocess.pl on <$testfile>\n" unless $status==0;
# delete the unnecessary file that get created by preprocessor.pl when used with the split option
unlink "$tmp_testfile-training","$train_plain-test";
# use the appropriate test and train file henceforth
$testfile = "$tmp_testfile-test";
$train_plain = "$train_plain-training";
$train_sval2=$testfile;
}
else
{
# convert test in sval2 to plain and also clean the test file
$train_plain="$prefix.train_plain";
$tmp_testfile = "$testfile.pro";
$status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --count $train_plain $testfile");
die "Error in running preprocess.pl on <$testfile>\n" unless $status==0;
# use the clean test file henceforth
$testfile = $tmp_testfile;
$train_sval2=$testfile;
}
}
############################################
# Localizing the Context Scope in Training
############################################
if(defined $opt_scope_train)
{
if(defined $opt_verbose)
{
print STDERR "Localizing the Context Scope in TRAIN ...\n";
}
if(!defined $train_sval2)
{
# converting training data to sval2 format
$train_sval2="$prefix.train_sval2";
$status=system("text2sval.pl $train_plain > $train_sval2");
die "Could not run text2sval.pl on <$train_plain>\n" unless $status==0;
}
# running windower
$train_context="$prefix.train_context";
if(defined $opt_target)
{
$status=system("windower.pl --plain --target $target --token $token $train_sval2 $opt_scope_train > $train_context");
die "Error while running windower.pl on <$train_sval2>\n" unless $status ==0;
}
else
{
$status=system("windower.pl --plain --token $token $train_sval2 $opt_scope_train > $train_context");
die "Error while running windower.pl on <$train_sval2>\n" unless $status ==0;
}
$train=$train_context;
}
else
{
$train=$train_plain;
}
######################
# Selecting Features
######################
if($opt_feature =~ /^uni(gram)?/)
{
if(defined $opt_verbose)
{
print STDERR "Computing Unigram Counts ...\n";
}
$unigrams="$prefix.unigrams";
$count_string="";
if(defined $opt_remove)
{
$count_string="--remove $opt_remove ";
}
if(defined $opt_stop)
{
$count_string.="--stop $opt_stop ";
}
$status=system("count.pl --ngram 1 --newLine --token $token $count_string $unigrams $train");
die "Error while running count.pl with --ngram 1 on <$train>\n" unless $status==0;
}
else
{
###########################
# Computing Bigram Counts
###########################
if(defined $opt_verbose)
{
print STDERR "Computing Bigram Counts ...\n";
}
$bigrams="$prefix.bigrams";
$count_string="";
if(defined $opt_remove)
{
$count_string="--remove $opt_remove ";
}
if(defined $opt_window)
{
$count_string.="--window $opt_window ";
}
if(defined $opt_stop)
{
$count_string.="--stop $opt_stop ";
}
$status=system("count.pl --extended --newLine --token $token $count_string $bigrams $train");
die "Error while running count.pl on <$train>\n" unless $status==0;
###################
# Combining Counts
###################
if($opt_feature =~/^(co(occur|c)?|tco(occur|c)?)/)
{
if(defined $opt_verbose)
{
print STDERR "Combining Bigrams into Co-occurrence pairs ...\n";
}
# check the number of bigram features present
open (INP,"<$bigrams") || die "Error($0):
Error(code=$!) in opening <$bigrams> file\n";
my $feat_cnt = 0;
while(<INP>)
{
# skip the header
if(/^@/)
{
next;
}
# capture the count
if(/^(\d+)/)
{
$feat_cnt = $1;
last;
}
}
if(!$feat_cnt)
{
if($opt_feature =~/^tco(occur|c)?/)
{
print STDERR "ERROR($0):
0 FEATURES found in the <$bigrams> file.
This will lead to 0 co-occurrence features and 0 target
co-occurrence features. Therefore aborting the experiment.\n";
}
else
{
print STDERR "ERROR($0):
0 FEATURES found in the <$bigrams> file.
This will lead to 0 co-occurrence features. Therefore aborting
the experiment.\n";
}
exit 1;
}
$pairs="$prefix.cocs";
$status=system("combig.pl $bigrams > $pairs");
die "Error while running combig.pl on <$bigrams>\n" unless $status==0;
if($opt_feature =~ /^tco(occur|c)?/) # target co-occurrences
{
if(defined $opt_verbose)
{
print STDERR "Finding Target Co-occurrences ...\n";
}
# select the target co-occurrences from the *.cocs file
$target_pairs = "$prefix.target_cocs";
open (INP,"<$pairs") || die "Error($0):
Error(code=$!) in opening <$pairs> file\n";
open (OUT,">$target_pairs") || die "Error($0):
Error(code=$!) in opening <$target_pairs> file.\n";
# select the word pairs with target word and write to a temp file
# keep the count of number of such target word-pairs selected
# extract the total number of features from the cocs file
# usually the first number in the file.
$total_feat = 0;
do
{
$sent = <INP>;
if($sent =~ m/^\s*(\d+)\s*$/)
{
$total_feat = $1;
}
} until($total_feat != 0);
# write the total number of features on the first line of the output file
print OUT "$total_feat\n";
while(<INP>)
{
# find and write out the target co-occurrences to the output file
if(m/<head>.+<\/head>/i)
{
print OUT;
}
}
close INP;
close OUT;
$pairs=$target_pairs;
}
}
else
{
$pairs=$bigrams;
}
######################
# Running Statistic
######################
if(defined $opt_stat)
{
if(defined $opt_verbose)
{
print STDERR "Performing Statistics on Word Pairs ...\n";
}
$statistic="$prefix.statistic";
$stat_string="";
if(defined $opt_stat_rank)
{
$stat_string.="--rank $opt_stat_rank ";
}
if(defined $opt_stat_score)
{
$stat_string.="--score $opt_stat_score ";
}
# included statistic.pl's --precision option if format option specified
$stat_string .= " --precision $prec ";
$stat_string.="$opt_stat ";
$status=system("statistic.pl $stat_string $statistic $pairs");
die "Error while running statistic.pl on <$pairs>\n" unless $status ==0;
$scores=$statistic;
}
else
{
$scores=$pairs;
}
}
#############################
# Creating Context Vectors
#############################
$vectors="$prefix.vectors";
# -------------------------
# defining context scope
# -------------------------
if(defined $opt_scope_test)
{
if(defined $opt_verbose)
{
print STDERR "Localizing the Context Scope in TEST ...\n";
}
$test_context="$prefix.test_context";
if(defined $opt_target)
{
$status=system("windower.pl --token $token --target $target $testfile $opt_scope_test > $test_context");
die "Error while running windower.pl on <$testfile>\n" unless $status==0;
}
else
{
$status=system("windower.pl --token $token $testfile $opt_scope_test > $test_context");
die "Error while running windower.pl on <$testfile>\n" unless $status==0;
}
}
else
{
$test_context=$testfile;
}
$rlabel="$prefix.rlabel";
if(defined $opt_eval)
{
$rclass="$prefix.rclass";
$rclass_string="--rclass $rclass";
}
else
{
$rclass_string="";
}
$clabel="$prefix.clabel";
# turned ON if svd defined and actually applied
my $svd_flag = 0;
# default context representation is order2
if(!defined $opt_context || $opt_context =~/o2|order2/)
{
# do not rename any feature file to .features file yet, since
# wordvec.pl produces a new .features file
# just decide for now which is the features file
if ($opt_feature =~ /^uni(gram)?/) {
$featuresfile = $unigrams;
} else {
$featuresfile = $scores;
}
# check if atleast 10 feature present in the features file.
open(FEAT,$featuresfile) || die "Error($0):
Error(code=$!) while opening the feature file <$featuresfile>\n";
# read the complete file in single instruction instead of reading line by line.
my $temp_delimiter = $/;
$/ = undef;
my $inp_str = <FEAT>;
$/ = $temp_delimiter;
close FEAT;
my $feat_cnt = 0;
while($inp_str =~ m/<>.*\n/g && $feat_cnt < 10)
{
$feat_cnt++;
}
if($feat_cnt < 10)
{
print STDERR "ERROR($0):
Only $feat_cnt FEATURES found in the <$scores> file.
At least 10 FEATURES required to proceed with context
representation.\n";
exit 1;
}
if (defined $opt_lsa) {
# we will get feature vectors from a feature-by-context matrix
# the extension is maintained to be .wordvec to be
# consistent with the web interface interpretation as of now
$featvec="$prefix.wordvec";
$features = "$prefix.features";
# move the appropriate feature output file as the .features file
if ($opt_feature =~ /^uni(gram)?/) {
$status = system("mv $unigrams $features");
die "Error while moving <$unigrams> file to <$features>\n" unless $status==0;
} else {
$status = system("mv $scores $features");
die "Error while moving <$scores> file to <$features>\n" unless $status==0;
}
# -----------------------
# finding feature regexs
# -----------------------
if(defined $opt_verbose)
{
print STDERR "Finding Feature Regex/s ...\n";
}
$feature_regex="$prefix.regex";
$status=system("nsp2regex.pl $features > $feature_regex");
die "Error while running nsp2regex.pl on <$features>\n" unless $status==0;
if(defined $opt_verbose)
{
print STDERR "Building First Order Vectors for LSA...\n";
}
# we are doing context clustering in lsa fashion
# binary requested
if(defined $opt_binary)
{
$binary="--binary";
}
else
{
$binary="";
}
$o1_presvd="$prefix.o1_presvd";
# do not generate the .rclass file and the .rlabel / .clabel files
# generate the .testregex file which corresponds to the features
# identified in the test data, this needs to be passed to
# order2vec.pl later
# Also specify --transpose option, for getting a feature-by-context
# representation
$testregex = "$prefix.testregex";
$status=system("order1vec.pl --transpose --testregex $testregex $binary $test_context $feature_regex > $o1_presvd");
die "Error while running order1vec.pl on <$test_context>\n" unless $status==0;
# the keyfile produced by order1vec.pl should be removed, since later
# order2vec.pl will create another one
unlink <keyfile*.key>;
# set input file for svd
$svdinput = $o1_presvd;
# set an output file name for svd
$postsvdvectors = $featvec;
} else {
# we are doing either context clustering or word clustering in SC fashion
if(defined $opt_verbose)
{
print STDERR "Building Word Vectors ...\n";
}
$wordvec="$prefix.wordvec";
# creating word vectors from scores file
$wordvec_presvd="$prefix.wordvec_presvd";
$features = "$prefix.features";
$dims="$prefix.dims";
$wordvec_string="--feats $features --dims $dims ";
if($opt_feature=~/^co(occur|c)?|tco(occur|c)?/)
{
$wordvec_string.="--wordorder nocare ";
}
else
{
$wordvec_string.="--wordorder follow ";
}
if(defined $opt_binary)
{
$wordvec_string.="--binary ";
}
$status=system("wordvec.pl --format $format $wordvec_string $scores > $wordvec_presvd");
die "ERROR($0): Error while running wordvec.pl\n" unless $status==0;
# set input file for svd
$svdinput = $wordvec_presvd;
# set an output file name for svd
$postsvdvectors = $wordvec;
}
# SVD
if(defined $opt_svd)
{
# Check if performing svd will reduce the number of features i.e. number of columns
# less than or equal to 10, if so do not perform svd
open(INSVD,$svdinput) || die "Error($0):
Error(code=$!) in opening Matrix file <$svdinput>\n";
# line1 in Matrix file should either show the
# <keyfile> tag or #rows #cols #nnz
$line1=<INSVD>;
if($line1=~/keyfile/)
{
$line1=<IN>;
}
if($line1=~/^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/)
{
$rows=$1;
$cols=$2;
$nnz1=$3;
}
else
{
print STDERR "ERROR($0):
Line $line1 in Matrix file <$svdinput> should show #rows #cols #nnz\n";
exit 1;
}
close INSVD;
$flag_svd = 0;
$maxprs=$opt_k > ($cols/$opt_rf) ? int($cols/$opt_rf) : $opt_k;
if($maxprs >= 10)
{
if(defined $opt_verbose)
{
print STDERR "Performing SVD ...\n";
}
$svd_flag = 1;
# calling svd(input,output)
svd($svdinput, $postsvdvectors);
$flag_svd = 1;
}
else
{
print STDERR "WARNING($0):
SVD could not be performed on SVDINPUT <$svdinput>
because svd with reduction factor = $opt_k and scaling
factor = $opt_rf would reduce the resultant number of
features to = $maxprs, computed via (min($opt_k, $cols/$opt_rf)).
The minimum number of features required for representing
the contexts is 10\n";
$status=system("mv $svdinput $postsvdvectors");
die "Error while creating <$postsvdvectors> file.\n" unless $status==0;
}
}
else
{
$status=system("mv $svdinput $postsvdvectors");
die "Error while creating <$postsvdvectors> file.\n" unless $status==0;
}
# If word clustering (synonym finding) do not create context vectors but
# instead pass the word vectors to the clustering stage.
if(defined $opt_wordclust)
{
$status=system("mv $wordvec $vectors");
die "Error while creating <$vectors> file.\n" unless $status==0;
$status=system("mv $features $rlabel");
die "Error while creating <$rlabel> file.\n" unless $status==0;
$status=system("mv $dims $clabel");
die "Error while creating <$clabel>\n" unless $status==0;
}
else
{
# --------------------------
# Creating Context Vectors
# --------------------------
if (!defined $opt_lsa) {
# only in native SC order2 context clustering mode, generate
# a regex file from the output of wordvec.pl. we don't do
# this immediately after calling wordvec.pl above as that
# will be unnecessarily created in SC word clustering mode
# generate a .testregex file from the $features file created by
# wordvec.pl
$testregex = "$prefix.testregex";
$status=system("nsp2regex.pl $features > $testregex");
die "Error while running nsp2regex.pl on <$features>\n" unless $status==0;
}
if(defined $opt_verbose)
{
print STDERR "Building 2nd Order Context Vectors ...\n";
}
$context_string="--rlabel $rlabel ";
if(defined $opt_svd && $flag_svd == 1)
{
$context_string.="--dense ";
}
if(defined $opt_binary)
{
$context_string.="--binary ";
}
$status=system("order2vec.pl --format $format $context_string $rclass_string $test_context $postsvdvectors $testregex > $vectors");
die "Error while running order2vec.pl on <$test_context>\n" unless $status==0;
}
}
# requested context type is order1
else
{
$features="$prefix.features";
if($opt_feature=~/^uni(gram)?/)
{
$status=system("mv $unigrams $features");
die "Error while creating Unigram Feature file <$features>\n" unless $status==0;
}
else
{
$status=system("mv $scores $features");
die "Error while creating Bigram Feature file <$features>\n" unless $status==0;
}
# else # target co-occurrences
# {
# if(defined $opt_verbose)
# {
# print STDERR "Finding Target Co-occurrences ...\n";
# }
# # run kocos to find co-occurrences from scores file
# $status=system("kocos.pl --order 1 --regex $target $scores > $features");
# die "Error while running kocos.pl on $scores.\n" unless $status==0;
# }
# check if atleast 10 feature present in the features file.
open(FEAT,$features) || die "Error($0):
Error(code=$!) while opening the feature file <$features>\n";
# read the complete file in single instruction instead of reading line by line.
my $temp_delimiter = $/;
$/ = undef;
my $inp_str = <FEAT>;
$/ = $temp_delimiter;
close FEAT;
my $feat_cnt = 0;
while($inp_str =~ m/<>.*\n/g && $feat_cnt < 10)
{
$feat_cnt++;
}
if($feat_cnt < 10)
{
print STDERR "ERROR($0):
Only $feat_cnt FEATURES found in the <$scores> file.
At least 10 FEATURES required to proceed with context
representation.\n";
exit 1;
}
# -----------------------
# finding feature regexs
# -----------------------
if(defined $opt_verbose)
{
print STDERR "Finding Feature Regex/s ...\n";
}
$feature_regex="$prefix.regex";
$status=system("nsp2regex.pl $features > $feature_regex");
die "Error while running nsp2regex.pl on <$features>\n" unless $status==0;
# -------------------------
# creating context vectors
# -------------------------
if(defined $opt_verbose)
{
print STDERR "Building 1st Order Context Vectors ...\n";
}
# binary requested
if(defined $opt_binary)
{
$binary="--binary";
}
else
{
$binary="";
}
$o1_presvd="$prefix.o1_presvd";
if (defined $opt_lsa) {
# do not create .rclass file and .clabel file in word / feature
# clustering
# create the .rlabel file and specify --transpose option to get
# feature-by-context output
# MJ - 06/30/2006
# we also need to specify --testregex option with --transpose,
# although we don't use it in LSA feature clustering.
$testregex = "$prefix.testregex";
$status=system("order1vec.pl --transpose --testregex $testregex --rlabel $rlabel $binary $test_context $feature_regex > $o1_presvd");
} else {
# print STDERR "order1vec.pl $binary --rlabel $rlabel $rclass_string --clabel $clabel $test_context $feature_regex > $o1_presvd\n";
$status=system("order1vec.pl $binary --rlabel $rlabel $rclass_string --clabel $clabel $test_context $feature_regex > $o1_presvd");
}
die "ERROR ($0):
Error (code=$!) while running order1vec.pl on <$test_context>\n" unless $status==0;
$svdinput = $o1_presvd;
# SVD
if(defined $opt_svd)
{
# Check if performing svd will reduce the number of features i.e. number of columns
# less than or equal to 10, if so do not perform svd
open(INSVD,$svdinput) || die "Error($0):
Error(code=$!) in opening Matrix file <$svdinput>\n";
# line1 in Matrix file should either show the
# <keyfile> tag or #rows #cols #nnz
$line1=<INSVD>;
if($line1=~/keyfile/)
{
$line1=<IN>;
}
if($line1=~/^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/)
{
$rows=$1;
$cols=$2;
$nnz1=$3;
}
else
{
print STDERR "ERROR($0):
Line $line1 in Matrix file <$svdinput> should show #rows #cols #nnz\n";
exit 1;
}
close INSVD;
$maxprs=$opt_k > ($cols/$opt_rf) ? int($cols/$opt_rf) : $opt_k;
if($maxprs >= 10)
{
if(defined $opt_verbose)
{
print STDERR "Performing SVD ...\n";
}
$svd_flag = 1;
# calling svd function
svd($svdinput,$vectors);
}
else
{
print STDERR "WARNING($0):
SVD could not be performed on SVDINPUT <$svdinput>
because svd with reduction factor = $opt_k and scaling
factor = $opt_rf would reduce the resultant number of
features to = $maxprs, computed via (min($opt_k, $cols/$opt_rf)).
The minimum number of features required for representing
the contexts is 10\n";
$status=system("mv $svdinput $vectors");
die "Error while creating file <$vectors>\n" unless $status==0;
}
}
else
{
$status=system("mv $svdinput $vectors");
die "Error while creating file <$vectors>\n" unless $status==0;
}
}
##############
# Clustering
##############
# cluster stopping param string
$cluststop_str = "";
# params common to both vcluster and scluster
$cluster_str ="--rlabelfile $rlabel ";
if(defined $opt_clmethod)
{
$cluster_str .="--clmethod $opt_clmethod ";
if($opt_clmethod =~ /^(rb|rbr|direct|agglo|bagglo)$/i)
{
$cluststop_str .="--clmethod $opt_clmethod ";
}
else
{
$cluststop_str .="--clmethod rb ";
}
}
if(defined $opt_crfun)
{
$cluster_str .="--crfun $opt_crfun ";
if($opt_crfun =~ /^(i1|i2|h1|h2|e1)$/i)
{
$cluststop_str .="--crfun $opt_crfun ";
}
else
{
$cluststop_str .="--crfun i2 ";
}
}
# cluster in vector space
if(!defined $opt_space || $opt_space =~/^vector$/)
{
if(defined $opt_verbose)
{
print STDERR "Clustering in Vector Space ...\n";
}
# build the string of params for vcluster
$vclus_str = $cluster_str;
if(defined $opt_sim)
{
$vclus_str .= "--sim $opt_sim ";
if($opt_sim =~ /^(cos|corr)$/i)
{
$cluststop_str .= "--sim $opt_sim ";
}
else
{
$cluststop_str .= "--sim cos ";
}
if($opt_sim =~ /^co/)
{
$vclus_str .="--showfeatures ";
}
}
$clabel_str = "";
if (-f $clabel) {
$clabel_str = "--clabel $clabel";
}
$vclus_str .="--nfeatures 10 $clabel_str ";
# row scaling option
if(defined $opt_rowmodel)
{
$vclus_str .= "--rowmodel $opt_rowmodel ";
$cluststop_str .= "--rowmodel $opt_rowmodel ";
}
else
{
$vclus_str .= "--rowmodel none ";
$cluststop_str .= "--rowmodel none ";
}
# column scaling option
if(defined $opt_colmodel)
{
$vclus_str .= "--colmodel $opt_colmodel ";
$cluststop_str .= "--colmodel $opt_colmodel ";
}
else
{
$vclus_str .= "--colmodel none ";
$cluststop_str .= "--colmodel none ";
}
# cluster stopping
if(defined $opt_cluststop)
{
$cluststop = $opt_cluststop;
if(defined $opt_verbose)
{
print STDERR "Finding Number of Clusters with Cluster Stopping...\n";
}
if(defined $opt_threspk1)
{
$cluststop_str .= "--threspk1 $opt_threspk1 ";
}
if(defined $opt_delta)
{
$cluststop_str .= "--delta $opt_delta ";
}
if(defined $opt_B)
{
$cluststop_str .= "--B $opt_B ";
}
if(defined $opt_typeref)
{
$cluststop_str .= "--typeref $opt_typeref ";
}
if(defined $opt_percentage)
{
$cluststop_str .= "--percentage $opt_percentage ";
}
if(defined $opt_seed)
{
$cluststop_str .= "--seed $opt_seed ";
}
$cluststop_str .= "--space vector --measure $opt_cluststop --precision $prec ";
$status = system("clusterstopping.pl --prefix $prefix $cluststop_str $vectors >& $prefix.predictions");
# error handling for clusterstopping.pl
if ($status != 0)
{
my $tmp = uc $opt_cluststop;
# if predictions file not created fall-back to using the default #clusters
if(!-e "$prefix.predictions")
{
print STDERR "WARNING($0):
Could not locate the PREDICTIONS <$prefix.predictions>
file which indicates that the cluster-stopping measure
$tmp failed to predict the optimal number of clusters
for the VECTORS <$vectors> file.
Proceeding with the default number of clusters of $default_clusters\n\n";
# default number of clusters
$opt_clusters = $default_clusters;
}
else
{
# if predictions file exists then print out the error message present in the file
# and fall-back to using the default #clusters
open (TFP,"$prefix.predictions");
$errstr = "";
while(<TFP>)
{
$errstr .= $_;
}
print STDERR "WARNING($0):
$errstr
The cluster-stopping measure $tmp failed to predict the
optimal number of clusters for <$vectors>
Proceeding with the default number of clusters of $default_clusters\n\n";
# default #clusters
$opt_clusters = $default_clusters;
}
# undefine cluster-stopping option to indicate that the #clusters being used is not
# predicted by the measures but is set manually to the default value.
$opt_cluststop = undef;
# proceed with the default #clusters
$num_k = 0;
$predict[$num_k] = $opt_clusters;
$cluster_solution ="$prefix.cluster_solution";
$cluster_output ="$prefix.cluster_output";
$vclus_str .="--clustfile $cluster_solution ";
# running vcluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotclusters $prefix.dendogram.ps --plotformat ps ";
system("vcluster $vclus_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output");
}
else
{
system("vcluster $vclus_str $rclass_string $vectors $opt_clusters > $cluster_output");
}
}
else # If clusterstopping.pl ran successfully.
{
open (TFP,"$prefix.predictions") || die "Error($0):
Error(code=$!) in opening <$prefix.predictions> file.\n";
$num_k = 0;
while(<TFP>)
{
chomp;
$predict[$num_k++] = $_;
}
$num_k--;
close TFP;
$i = 0;
while($i <= $num_k)
{
$opt_clusters = $predict[$i];
if($cluststop ne "all" && $cluststop ne "pk")
{
$cluster_solution ="$prefix.cluster_solution.$cluststop";
$cluster_output ="$prefix.cluster_output.$cluststop";
$dendo_file = "$prefix.$cluststop.dendogram.ps";
}
else
{
if($i == 0)
{
$cluster_solution ="$prefix.cluster_solution.pk1";
$cluster_output ="$prefix.cluster_output.pk1";
$dendo_file = "$prefix.pk1.dendogram.ps";
}
elsif($i == 1)
{
$cluster_solution ="$prefix.cluster_solution.pk2";
$cluster_output ="$prefix.cluster_output.pk2";
$dendo_file = "$prefix.pk2.dendogram.ps";
}
elsif($i == 2)
{
$cluster_solution ="$prefix.cluster_solution.pk3";
$cluster_output ="$prefix.cluster_output.pk3";
$dendo_file = "$prefix.pk3.dendogram.ps";
}
elsif($i == 3)
{
$cluster_solution ="$prefix.cluster_solution.gap";
$cluster_output ="$prefix.cluster_output.gap";
$dendo_file = "$prefix.gap.dendogram.ps";
}
}
$update_str ="--clustfile $cluster_solution ";
# running vcluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotclusters $dendo_file --plotformat ps ";
system("vcluster $vclus_str $update_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output");
}
else
{
system("vcluster $vclus_str $update_str $rclass_string $vectors $opt_clusters > $cluster_output");
}
$i++;
}
}
}
else # if not using cluster stopping measures
{
$num_k = 0;
$predict[$num_k] = $opt_clusters;
$cluster_solution ="$prefix.cluster_solution";
$cluster_output ="$prefix.cluster_output";
$vclus_str .="--clustfile $cluster_solution ";
# running vcluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotclusters $prefix.dendogram.ps --plotformat ps ";
system("vcluster $vclus_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output");
}
else
{
system("vcluster $vclus_str $rclass_string $vectors $opt_clusters > $cluster_output");
}
}
}
else # cluster in similarity space
{
if(defined $opt_verbose)
{
print STDERR "Building Similarity Matrix ...\n";
}
# creating similarity matrix
$simat="$prefix.simat";
my $simat_string = " ";
if(defined $opt_svd && $svd_flag == 1)
{
$simat_string ="--dense ";
}
if(defined $opt_binary)
{
if(defined $opt_sim)
{
$simat_string .="--measure $opt_sim ";
}
$sim_program ="bitsimat.pl";
}
else
{
$sim_program ="simat.pl";
}
$status=system("$sim_program --format $format $simat_string $vectors > $simat");
die "Error while running $sim_program\n" unless $status==0;
if(defined $opt_verbose)
{
print STDERR "Clustering in Similarity Space ...\n";
}
# build the string of params for scluster
$sclus_str = $cluster_str;
# cluster stopping
if(defined $opt_cluststop)
{
$cluststop = $opt_cluststop;
if(defined $opt_verbose)
{
print STDERR "Finding Number of Clusters with Cluster Stopping...\n";
}
if(defined $opt_threspk1)
{
$cluststop_str .= "--threspk1 $opt_threspk1 ";
}
if(defined $opt_delta)
{
$cluststop_str .= "--delta $opt_delta ";
}
if(defined $opt_B)
{
$cluststop_str .= "--B $opt_B ";
}
if(defined $opt_typeref)
{
$cluststop_str .= "--typeref $opt_typeref ";
}
if(defined $opt_percentage)
{
$cluststop_str .= "--percentage $opt_percentage ";
}
if(defined $opt_seed)
{
$cluststop_str .= "--seed $opt_seed ";
}
$cluststop_str .= "--space similarity --measure $opt_cluststop --precision $prec ";
$status = system("clusterstopping.pl --prefix $prefix $cluststop_str $simat >& $prefix.predictions");
# error handling for clusterstopping.pl
# If clusterstopping.pl returned an error code
if ($status != 0)
{
my $tmp = uc $opt_cluststop;
# if predictions file not created fall-back to using the default #clusters
if(!-e "$prefix.predictions")
{
print STDERR "WARNING($0):
Could not locate the PREDICTIONS <$prefix.predictions>
file which indicates that the cluster-stopping measure
$tmp failed to predict the optimal number of clusters
for the VECTORS <$vectors> file.
Proceeding with the default number of clusters of $default_clusters\n\n";
# default #clusters
$opt_clusters = $default_clusters;
}
else
{
# if predictions file exists then print out the error message present in the file
# and fall-back to using the default #clusters
open (TFP,"$prefix.predictions");
$errstr = "";
while(<TFP>)
{
$errstr .= $_;
}
print STDERR "WARNING($0):
$errstr
The cluster-stopping measure $tmp failed to predict the
optimal number of clusters for the given data.
Proceeding with the default number of clusters of $default_clusters\n\n";
# default #clusters
$opt_clusters = $default_clusters;
}
# undefine cluster-stopping option to indicate that the #clusters being used is not
# predicted by the measures but is set manually to the default value.
$opt_cluststop = undef;
# proceed with the default #clusters
$num_k = 0;
$predict[$num_k] = $opt_clusters;
$cluster_solution ="$prefix.cluster_solution";
$cluster_output ="$prefix.cluster_output";
$sclus_str .="--clustfile $cluster_solution ";
# running scluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotsclusters $prefix.dendogram.ps --plotformat ps ";
system("scluster $sclus_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output");
}
else
{
system("scluster $sclus_str $rclass_string $simat $opt_clusters > $cluster_output");
}
}
else # If clusterstopping.pl ran successfully.
{
open (TFP,"$prefix.predictions") || die "Error($0):
Error(code=$!) in opening <$prefix.predictions> file.\n";
$num_k = 0;
while(<TFP>)
{
chomp;
$predict[$num_k++] = $_;
}
$num_k--;
close TFP;
$i = 0;
while($i <= $num_k)
{
$opt_clusters = $predict[$i];
if($cluststop ne "all" && $cluststop ne "pk")
{
$cluster_solution ="$prefix.cluster_solution.$cluststop";
$cluster_output ="$prefix.cluster_output.$cluststop";
$dendo_file = "$prefix.$cluststop.dendogram.ps";
}
else
{
if($i == 0)
{
$cluster_solution ="$prefix.cluster_solution.pk1";
$cluster_output ="$prefix.cluster_output.pk1";
$dendo_file = "$prefix.pk1.dendogram.ps";
}
elsif($i == 1)
{
$cluster_solution ="$prefix.cluster_solution.pk2";
$cluster_output ="$prefix.cluster_output.pk2";
$dendo_file = "$prefix.pk2.dendogram.ps";
}
elsif($i == 2)
{
$cluster_solution ="$prefix.cluster_solution.pk3";
$cluster_output ="$prefix.cluster_output.pk3";
$dendo_file = "$prefix.pk3.dendogram.ps";
}
elsif($i == 3)
{
$cluster_solution ="$prefix.cluster_solution.gap";
$cluster_output ="$prefix.cluster_output.gap";
$dendo_file = "$prefix.gap.dendogram.ps";
}
}
$update_str ="--clustfile $cluster_solution ";
# running scluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotsclusters $dendo_file --plotformat ps ";
system("scluster $sclus_str $update_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output");
}
else
{
system("scluster $sclus_str $update_str $rclass_string $simat $opt_clusters > $cluster_output");
}
$i++;
}
}
}
else # if not using cluster stopping measures
{
$num_k = 0;
$predict[$num_k] = $opt_clusters;
$cluster_solution ="$prefix.cluster_solution";
$cluster_output ="$prefix.cluster_output";
$sclus_str .="--clustfile $cluster_solution ";
# running scluster
# use the -showtree option only if the #clusters is greater than 1
if($opt_clusters > 1)
{
my $tmp_fig_str = "--showtree --plotsclusters $prefix.dendogram.ps --plotformat ps ";
system("scluster $sclus_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output");
}
else
{
system("scluster $sclus_str $rclass_string $simat $opt_clusters > $cluster_output");
}
}
}
#*********************
# formatting clustering solution, show instances in each cluster
$i = 0;
while($i <= $num_k)
{
if(defined $opt_cluststop)
{
if($cluststop ne "all" && $cluststop ne "pk")
{
$clusters="$prefix.clusters.$cluststop";
$cluster_solution = "$prefix.cluster_solution.$cluststop";
$clusters_context = "$prefix.clusters_context.$cluststop";
}
else
{
if($i == 0)
{
$clusters="$prefix.clusters.pk1";
$cluster_solution = "$prefix.cluster_solution.pk1";
$clusters_context = "$prefix.clusters_context.pk1";
}
elsif($i == 1)
{
$clusters="$prefix.clusters.pk2";
$cluster_solution = "$prefix.cluster_solution.pk2";
$clusters_context = "$prefix.clusters_context.pk2";
}
elsif($i == 2)
{
$clusters="$prefix.clusters.pk3";
$cluster_solution = "$prefix.cluster_solution.pk3";
$clusters_context = "$prefix.clusters_context.pk3";
}
elsif($i == 3)
{
$clusters="$prefix.clusters.gap";
$cluster_solution = "$prefix.cluster_solution.gap";
$clusters_context = "$prefix.clusters_context.gap";
}
}
}
else # No. of Clusters: Set Manually
{
$clusters="$prefix.clusters";
$cluster_solution = "$prefix.cluster_solution";
$clusters_context = "$prefix.clusters_context";
}
if(defined $opt_wordclust)
{
$status=system("format_clusters.pl $cluster_solution $rlabel > $clusters");
die "Error while formatting clusters.\n" unless $status==0;
}
else
{
$status=system("format_clusters.pl $cluster_solution $rlabel --senseval2 $testfile > $clusters");
die "Error while formatting clusters.\n" unless $status==0;
# execute the format_clusters.pl with --context option and use this file to label the clusters.
$status=system("format_clusters.pl $cluster_solution $rlabel --context $testfile > $clusters_context");
die "Error while running format_clusters.pl $cluster_solution $rlabel --context $testfile > $clusters_context\n" unless $status==0;
}
$i++;
}
if(!defined $opt_wordclust)
{
# create the parameter string for clusterlabeling.pl
if(defined $opt_verbose)
{
print STDERR "Creating Cluster Labels ...\n";
}
$cluslabel_str = " --token $token ";
if(defined $opt_label_window)
{
$cluslabel_str .= " --window $opt_label_window ";
}
if(defined $opt_label_ngram)
{
if($opt_label_ngram < 2 || $opt_label_ngram > 4)
{
print STDERR "\n ERROR($0):
Labeling mechanism only support bigrams, trigrams and 4-grams for feature selection..\n";
exit 1;
}
$cluslabel_str .= " --ngram $opt_label_ngram ";
}
if(defined $opt_label_stop)
{
$cluslabel_str .= " --stop $opt_label_stop ";
}
if(defined $opt_label_remove)
{
$cluslabel_str .= " --remove $opt_label_remove ";
}
if(defined $opt_label_stat)
{
$cluslabel_str .= " --stat $opt_label_stat ";
}
if(defined $opt_label_rank)
{
$cluslabel_str .= " --rank $opt_label_rank ";
}
$i = 0;
while($i <= $num_k)
{
if(defined $opt_cluststop)
{
if($cluststop ne "all" && $cluststop ne "pk")
{
$clusters_context = "$prefix.clusters_context.$cluststop";
$cluster_labels = "$prefix.cluster_labels.$cluststop";
$param_str = $cluslabel_str . "--prefix $prefix.$cluststop ";
}
else
{
if($i == 0)
{
$clusters_context = "$prefix.clusters_context.pk1";
$cluster_labels = "$prefix.cluster_labels.pk1";
$param_str = $cluslabel_str . "--prefix $prefix.pk1 ";
}
elsif($i == 1)
{
$clusters_context = "$prefix.clusters_context.pk2";
$cluster_labels = "$prefix.cluster_labels.pk2";
$param_str = $cluslabel_str . "--prefix $prefix.pk2 ";
}
elsif($i == 2)
{
$clusters_context = "$prefix.clusters_context.pk3";
$cluster_labels = "$prefix.cluster_labels.pk3";
$param_str = $cluslabel_str . "--prefix $prefix.pk3 ";
}
elsif($i == 3)
{
$clusters_context = "$prefix.clusters_context.gap";
$cluster_labels = "$prefix.cluster_labels.gap";
$param_str = $cluslabel_str . "--prefix $prefix.gap ";
}
}
}
else # No. of Clusters: Set Manually
{
$clusters_context = "$prefix.clusters_context";
$cluster_labels = "$prefix.cluster_labels";
$param_str = $cluslabel_str . "--prefix $prefix ";
}
# execute the cluster labeling program
$status=system("clusterlabeling.pl $param_str $clusters_context > $cluster_labels");
die "Error while running clusterlabeling.pl $param_str $clusters_context > $cluster_labels\n" unless $status==0;
$i++;
}
}
################
# Evaluation
################
if(defined $opt_eval)
{
if(defined $opt_verbose)
{
print STDERR "Evaluating ...\n";
}
$i = 0;
while($i <= $num_k)
{
if(defined $opt_cluststop)
{
if($cluststop ne "all" && $cluststop ne "pk")
{
$prelabel="$prefix.prelabel.$cluststop";
$label="$prefix.label.$cluststop";
$report="$prefix.report.$cluststop";
$cluster_solution ="$prefix.cluster_solution.$cluststop";
}
else
{
if($i == 0)
{
$prelabel="$prefix.prelabel.pk1";
$label="$prefix.label.pk1";
$report="$prefix.report.pk1";
$cluster_solution ="$prefix.cluster_solution.pk1";
}
elsif($i == 1)
{
$prelabel="$prefix.prelabel.pk2";
$label="$prefix.label.pk2";
$report="$prefix.report.pk2";
$cluster_solution ="$prefix.cluster_solution.pk2";
}
elsif($i == 2)
{
$prelabel="$prefix.prelabel.pk3";
$label="$prefix.label.pk3";
$report="$prefix.report.pk3";
$cluster_solution ="$prefix.cluster_solution.pk3";
}
elsif($i == 3)
{
$prelabel="$prefix.prelabel.gap";
$label="$prefix.label.gap";
$report="$prefix.report.gap";
$cluster_solution ="$prefix.cluster_solution.gap";
}
}
}
else # No. of Clusters: Set Manually
{
$prelabel="$prefix.prelabel";
$label="$prefix.label";
$report="$prefix.report";
$cluster_solution ="$prefix.cluster_solution";
}
$status=system("cluto2label.pl $cluster_solution keyfile*.key > $prelabel");
die "Error while running cluto2label.pl\n" unless $status==0;
$status=system("label.pl $prelabel > $label");
die "Error while running label.pl\n" unless $status==0;
$status=system("report.pl $label $prelabel > $report");
die "Error while running report.pl\n" unless $status==0;
$i++;
}
$status=system("mv keyfile*.key $prefix.key");
die "Error while creating the KEY file.\n" unless $status==0;
}
##################
# Printing Output
##################
if(defined $opt_cluststop)
{
if($opt_cluststop eq "all")
{
$predict_measure[0] = "PK1 measure";
$predict_measure[1] = "PK2 measure";
$predict_measure[2] = "PK3 measure";
$predict_measure[3] = "Adapted Gap Statistic";
}
elsif($opt_cluststop eq "pk")
{
$predict_measure[0] = "PK1 measure";
$predict_measure[1] = "PK2 measure";
$predict_measure[2] = "PK3 measure";
}
else
{
$predict_measure[0] = uc $opt_cluststop;
$predict_measure[0] .= " measure";
}
}
else
{
$predict_measure[0] = "Set manually";
}
$i = 0;
while($i <= $num_k)
{
print "\n=================================================================\n";
print "Output when #clusters = $predict[$i] ($predict_measure[$i])\n";
print "=================================================================\n";
if(defined $opt_cluststop)
{
if($cluststop ne "all" && $cluststop ne "pk")
{
$cluster_output ="$prefix.cluster_output.$cluststop";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report.$cluststop";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters.$cluststop";
print "\nClusters of given contexts can be found in file: <$clusters>\n\n";
}
else
{
if($i == 0)
{
$cluster_output ="$prefix.cluster_output.pk1";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report.pk1";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters.pk1";
print "\nClusters of given contexts can be found in file: $clusters\n\n";
}
elsif($i == 1)
{
$cluster_output ="$prefix.cluster_output.pk2";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report.pk2";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters.pk2";
print "\nClusters of given contexts can be found in file: $clusters\n\n";
}
elsif($i == 2)
{
$cluster_output ="$prefix.cluster_output.pk3";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report.pk3";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters.pk3";
print "\nClusters of given contexts can be found in file: $clusters\n\n";
}
elsif($i == 3)
{
$cluster_output ="$prefix.cluster_output.gap";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report.gap";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters.gap";
print "\nClusters of given contexts can be found in file: $clusters\n\n";
}
}
}
else # No. of Clusters: Set Manually
{
$cluster_output ="$prefix.cluster_output";
$status=system("cat $cluster_output");
die "Error while displaying the cluster results.\n" unless $status==0;
if(defined $opt_eval)
{
$report = "$prefix.report";
$status=system("cat $report");
die "Error while displaying the report file.\n" unless $status==0;
}
$clusters="$prefix.clusters";
print "\nClusters of given contexts can be found in file: $clusters\n\n";
}
$i++;
}
##############################################################################
# ==========================
# SUBROUTINE SECTION
# ==========================
sub svd
{
($svdin,$svdout)=@_;
# converting input to harwell-boeing format
$svd_string="";
if(defined $opt_k)
{
$svd_string="--k $opt_k ";
}
if(defined $opt_rf)
{
$svd_string.="--rf $opt_rf ";
}
if(defined $opt_iter)
{
$svd_string.="--iter $opt_iter ";
}
$numform = "5$format"; ## numform is 5f16.XX
$status=system("mat2harbo.pl --numform $numform --param $svd_string $svdin > matrix");
die "Error while running mat2harbo.pl on <$svdin>\n" unless $status==0;
system("las2");
$harbomat="$prefix.harbomat";
$status=system("mv matrix $harbomat");
die "Error in creating <$harbomat>\n" unless $status==0;
# reconstruction
$status=system("svdpackout.pl --rowonly --format $format lav2 lao2 > $svdout");
die "Error while running svdpackout.pl\n" unless $status==0;
}
#-----------------------------------------------------------------------------
#show minimal usage message
sub showminimal()
{
print "Usage: discriminate.pl [OPTIONS] TEST";
print "\nTYPE discriminate.pl --help for help\n";
}
#-----------------------------------------------------------------------------
#show help
sub showhelp()
{
print "Usage: discriminate.pl [OPTIONS] TEST
Wrapper program for SenseClusters' Toolkit. Discriminates among the
given text instances based on their contextual similarities.
TEST
Senseval-2 formatted TEST instance file containing the instances
to be clustered.
OPTIONS:
--training TRAIN
Specify the training file in plain text format. Instances from this
file are used for selecting features. If --training is not specified,
features are selected from the same TEST file.
--split N
Splits the given TEST file into two portions, N% for the use as the
TRAIN data and (100-N)% as the TEST data. The value for N is a
percentage and should be an integer between 1 to 99 (inclusive).
The instances from the original TEST file are not picked or split
in any particular order but are randomly split into the two portions
of TRAIN and TEST data while maintaining the ratio of N/(100-N).
Note: This option cannot be used when --training option is also used.
--token TOKEN
Specify a file containing Perl regex/s that define the tokenization
scheme in TRAIN and TEST files. By default, token.regex is searched
in the current directory.
--target TARGET
Specify a file containing Perl regex/s that identify the target word/s
whose senses are to be discriminated.
If --target is not specified, target.regex file is searched in the
current directory. If this file doesn't exist, target.regex is
automatically created by searching the <head> tags in the TEST data.
If no <head> tags are found in TEST, TEST is assumed to be global.
Note: --target cannot be specified with headless input data
i.e. test file without head/target word(s).
--prefix PRE
Specify the prefix to be used for output filenames.
--format f16.XX
The default format for floating point numbers is f16.06. This means
that there is room for 6 digits to the left of the decimal,
and 9 to the right. You may change XX to any value between 0
and 15, however, the format must remain 16 spaces long due to
formatting requirements of SVDPACKC.
--wordclust
Discriminates and clusters each word based upon its direct and indirect
co-occurrence with other words (when used without the --lsa switch) or
clusters words or features based upon their occurrences in different
contexts (when used with the --lsa switch).
Note: 1. Separate (--training) TRAIN data should not be used with word
clustering.
2. Starting with Version 0.93, word clustering is no longer
restricted to using only headless data. However, options
specific to headed data such as --scope_test and target
co-occurrence features (see below) cannot be used.
--lsa
Uses Latent Semantic Analysis (LSA) style representation for clustering
features or contexts. LSA representation is the transpose of
the context-by-feature matrix created using the native SenseClusters
order1 context representation.
This option can be used only in the following two combinations of
the --context and the --wordclust options:
1. --context o1 --wordclust --lsa
Performs feature clustering by grouping together features based on the
contexts that they occur in. Features can be unigrams, bigrams or
co-occurrences. Feature vectors are the rows of the transposed
context-by-feature representation created by order1vec.pl.
2. --context o2 --lsa
Performs context clustering by creating context vectors by averaging the
feature vectors from the transposed context-by-feature representation of
order1vec.pl.
Feature Options :
--feature TYPE
Specify the feature type to be used for representing contexts.
Possible options for feature type with first order context
representation:
bi - bigrams [default]
tco - target co-occurrences
co - co-occurrences
uni - unigrams
Possible options for feature type with second order context
representation:
bi - bigrams [default]
co - co-occurrences
tco - target co-occurrences
Note: --tco (target co-occurrences) cannot be used with headless
data i.e. test/train file without head/target word(s).
--scope_train S1
Context in TRAIN instances is limited to include only S1 words on the
left and right of the TARGET word. Use --scope_train only if every
training instance contains the TARGET word.
Note: --scope_train cannot be used with headless data i.e. train file
without head/target word(s).
--scope_test S2
Context in TEST instances is limited to include only S2 words on the
left and right of the TARGET word. Use --scope_test only if every
test instance contains the TARGET word.
Note: --scope_test cannot be used with headless data i.e. test file
without head/target word(s).
--remove F
Features occurring less than F number of times are removed from the
feature set.
--window W
Sets the window size for bigram and co-occurrence features. Words
occurring within W positions from each other (i.e. at most W-2
intervening words) form bigrams/co-occurrences.
--stop STOPFILE
Specify a file of Perl regex/s that define a stop list of words to be
excluded from the features.
--stat Stat
Performs the specified statistical test of association on bigrams/
co-occurrences. The test scores can be used to filter insignificant
pairs or in the feature vector representations.
The possible values of STAT are -
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
--stat_rank R
Word pairs ranking below R when arranged in descending order of
their test scores are ignored.
--stat_rank will be ignored unless --stat option is specified.
--stat_score S
Specify the score cutoff value to select pairs with test scores
greater than S.
--stat_score will be ignored unless option --stat is specified.
Vector Options :
--context ORD
Specify the context representation to be used to represent the TEST
instances. Set ORD to 'o1' to use 1st order context vectors and to
'o2' to use 2nd order context vectors. Default context representation
is o2.
--binary
Creates binary feature and context vectors. By default, the frequency
scores are retained by these vectors.
SVD Options :
--svd
Performs Singular Value Decomposition to reduce the feature space
dimensions.
--k K
Reduces dimensions of the feature space to K. Default is 300.
--rf RF
Specifies the reduction factor such that feature space with N
dimensions is reduced down to N/RF (RF >= 1). Default RF=10.
--iter I
Specifies the number of SVD iterations. Recommended value is (3 x K)
Cluster-Stopping Options:
--cluststop CS
Specify the cluster stopping measure to be used to predict the number
the number of clusters.
The possible option values:
pk1 - Use PK1 measure
[PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM]))]
pk2 - Use PK2 measure
[PK2[m] = (crfun[m]/crfun[m-1])]
pk3 - Use PK3 measure
[PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1]))]
gap - Use Adapted Gap Statistic.
pk - Use all the PK measures.
all - Use all the four cluster stopping measures.
More about these measures can be found in the documentation of
Toolkit/clusterstop/clusterstopping.pl
NOTE: Options --clusters and --cluststop cannot be used together.
--delta INT
NOTE: Delta value can only be a positive integer value.
Specify 0 to stop the iterating clustering process when two consecutive
crfun values are exactly equal. This is the default setting when the
crfun values are integer/whole numbers.
Specify non-zero positive integer to stop the iterating clustering
process when the difference between two consecutive crfun values
is less than or equal to this value. However, note that the integer
value specified is internally shifted to capture the difference in
the least significant digit of the crfun values when these crfun
values are fractional.
For example:
For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001
For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005
The default delta value when the crfun values are fractional is 1.
However if the crfun values are integer/whole numbers (exponent >= 2)
then the specified delta value is internally shifted only until the
least significant digit in the scientific notation.
For example:
For crfun = 1.23e+04 & delta = 2 will be transformed to 200
For crfun = 2.45e+02 & delta = 5 will be transformed to 5
For crfun = 1.44e+03 & delta = 1 will be transformed to 10
--threspk1 NUM
The threshold value that should be used by the PK1 measure to predict
the k value.
Default = -0.7
NOTE: This option should be used only when --cluststop option is also
used with option value of \"all\" or \"pk1\".
Cluster-Stopping: Adapted Gap Statistic Options:
--B NUM
The number of replicates/references to be generated.
Default: 1
--typeref TYP
Specifies whether to generate B replicates from a reference or to
generate B references.
The possible option values:
rep - replicates [Default]
ref - references
--percentage NUM
Specifies the percentage confidence to be reported in the log file.
Since Gap Statistic uses parametric bootstrap method for reference
distribution generation, it is critical to understand the interval
around the sample mean that could contain the population (\"true\")
mean and with what certainty.
Default: 90
--seed NUM
The seed to be used with the random number generator.
Default: No seed is set.
Clustering Options :
--clusters C
Specify the number of clusters to be created. Default is 2.
--space SPACE
Specifies whether clustering is to be performed in vector or similarity
space. Set SPACE to 'vector' to cluster context vectors directly in
vector space OR to 'similarity' to compose a similarity matrix and
cluster instances in similarity space. Default SPACE is vector.
--clmethod CL
Specifies the clustering method.
Possible option values are :
rb - Repeated Bisections [Default]
rbr - Repeated Bisections for by k-way refinement
direct - Direct k-way clustering
agglo - Agglomerative clustering
graph - Graph partitioning-based clustering
bagglo - Partitional biased Agglomerative clustering
--crfun CR
Selects the criteria function for Clustering. The meanings of these
criteria functions is explained in Cluto's manual.
The possible values are :
i1 - I1 Criterion function
i2 - I2 Criterion function [default for partitional]
e1 - E1 Criterion function
g1 - G1 Criterion function
g1p - G1' Criterion function
h1 - H1 Criterion function
h2 - H2 Criterion function
slink - Single link merging scheme
wslink - Single link merging scheme weighted w.r.t. cluster sim
clink - Complete link merging scheme
wclink - Complete link merging scheme weighted w.r.t. cluster sim
upgma - Group average merging scheme [default for agglomerative]
Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion
functions can only be used. If a crfun other than these is selected
then cluster stopping uses the default crfun (i2) while the final
clustering of contexts is performed using the crfun specified.
--sim SIM
Specifies the similarity measure to be used during clustering.
When --space is vector, possible option values of SIM are :
cos - Cosine Coefficient [default]
corr - Correlation Coefficient
dist - Euclidean distance
jacc - Extended Jaccard Coeeficient
When --space is similarity and --binary is ON,
possible values of SIM are :
cos - Cosine Coefficient [default]
mat - Match Coefficient
jac - Jaccard Coefficient
ovr - Overlap Coefficient
dic - Dice Coefficient
Otherwise, only cosine coefficient is available and is default.
--rowmodel RMOD
The option is used to specify the model to be used to scale every
column of each row. (For further details please refer Cluto manual)
The possible values for RMOD -
none - no scaling is performed (default setting)
maxtf - post scaling the values are between 0.5 and 1.0
sqrt - square-root of actual values
log - log of actual values
--colmodel CMOD
The option is used to specify the model to be used to (globally)
scale each column across all rows. (For further details please refer
Cluto manual)
The possible values for CMOD -
none - no scaling is performed (default setting)
idf - scaling according to inverse-document-frequency
Labeling Options :
Note: Labeling options cannot be used with word-clustering
(--wordclust).
--label_stop LABEL_STOPFILE
A file of Perl regexes that define the stop list of words to be
excluded from the labels.
--label_ngram LABEL_NGRAM
Specifies the value of n in 'n-gram' for the feature selection.
The supported values for n are 2, 3 and 4.
Default value is 2.
--label_remove LABEL_N
Removes ngrams that occur less than LABEL_N times.
--label_window LABEL_W
Specifies the window size for bigrams. Pairs of words that co-occur
within the specified window from each other (window LABEL_W allows at
most LABEL_W-2 intervening words) will form the bigram features.
Default window size is 2 which allows only consecutive word pairs.
--label_stat LABEL_STAT
Specifies the statistical scores of association.
Available tests of association are :
dice - Dice Coefficient
ll - Log Likelihood Ratio
odds - Odds Ratio
phi - Phi Coefficient
pmi - Point-Wise Mutual Information
tmi - True Mutual Information
x2 - Chi-Squared Test
tscore - T-Score
leftFisher - Left Fisher's Test
rightFisher - Right Fisher's Test
--label_rank LABEL_R
Features ranking below LABEL_R when arranged in descending order of
their test scores are ignored.
Other Options :
--eval
Evaluates clustering performace by comparing results against correct
answer keys.
Note: This option can be used only if the answer tags are provided
in the TEST file.
--rank_filter R
Allows to remove low frequency senses during evaluation. This will
remove the senses that rank below R when senses in TEST are arranged
in the descending order of their frequencies. In other words, it
selects top R most frequent senses. An instance will be removed if
it has all sense tags below rank R.
--percent_filter P
Allows to remove low frequency senses based on their percentage
frequencies. This will remove senses whose frequency is below P%
in the TEST data.
--showargs
Displays to STDOUT values of compulsory and optional arguments.
[NOT SUPPORTED IN THIS VERSION]
--verbose
Displays to STDERR the current program status.
--help
Displays this message.
--version
Displays the version information.
Type 'perldoc discriminate.pl' for more detailed information.\n";
}
#------------------------------------------------------------------------------
#version information
sub showversion()
{
print '$Id: discriminate.pl,v 1.108 2013/06/26 01:09:24 jhaxx030 Exp $';
print "\nDriver to Run SenseClusters\n";
## print "\nCopyright (c) 2002-2006, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, & Mahesh Joshi\n";
## print "Date of Last Update: 07/30/2006\n";
}
#############################################################################