discriminate.pl Wrapper program to run SenseClusters in a single command
Discriminates among the given text instances based on their contextual similarities.
discriminate.pl [OPTIONS] TEST
Senseval-2 formatted TEST instance file that contains the instances to be clustered.
Training file in plain text format that can be used to select features. If this is not specified, features are selected from the given TEST file.
Splits the given TEST file into two portions, N% for the use as the TRAIN data and (100-N)% as the TEST data. The value for N is a percentage and should be an integer between 1 to 99 (inclusive). The instances from the original TEST file are not picked or split in any particular order but are randomly split into the two portions of TRAIN and TEST data while maintaining the ratio of N/(100-N).
Note: This option cannot be used when --training option is also used.
A file containing Perl regex/s that define the tokenization scheme in TRAIN and TEST files. If --token is not specified, default token regex file token.regex is searched in the current directory.
A file containing Perl regex/s for identifying the target word. A sample target.regex file containing regex:
/<head>\w+</head>/
is provided with this distribution. If --target is not specified, default target regex file target.regex is searched in the current directory. If this file doesn't exist, target.regex is automatically created by finding all instances of <head> tags from the TEST data. If there are no instances of <head> tags in TEST, the given data is assumed to be global and target word is not searched in either TRAIN or TEST.
Note: --target cannot be specified with headless input data i.e. test file without head/target word(s).
Specify a prefix to be used in all output file names. e.g. context vector file will have name 'PRE.vectors', features file will have name 'PRE.features' and so on ... By default, a random prefix is created using the time stamp.
The default format for floating point numbers is f16.06. This means that there is room for 6 digits to the right of the decimal, and 9 to the left. You may change XX to any value between 0 and 15, however, the format must remain 16 spaces long due to formatting requirements of SVDPACKC.
Discriminates and clusters each word based upon its direct and indirect co-occurrence with other words (when used without the --lsa switch) or clusters words or features based upon their occurrences in different contexts (when used with the --lsa switch).
Note: 1. Separate (--training) TRAIN data should not be used with word clustering. 2. Starting with Version 0.93, word clustering is no longer restricted to using only headless data. However, options specific to headed data such as --scope_test and target co-occurrence features (see below) cannot be used.
Uses Latent Semantic Analysis (LSA) style representation for clustering features or contexts. LSA representation is the transpose of the context-by-feature matrix created using the native SenseClusters order1 context representation.
This option can be used only in the following two combinations of the --context and the --wordclust options:
Performs feature clustering by grouping together features based on the contexts that they occur in. Features can be unigrams, bigrams or co-occurrences. Feature vectors are the rows of the transposed context-by-feature representation created by order1vec.pl.
Performs context clustering by creating context vectors by averaging the feature vectors from the transposed context-by-feature representation of order1vec.pl.
Specify the feature type to be used for representing contexts. Possible options for feature type with first order context representation:
bi - bigrams [default] tco - target co-occurrences co - co-occurrences uni - unigrams
Possible options for feature type with second order context representation:
bi - bigrams [default] co - co-occurrences tco - target co-occurrences Note: --tco (target co-occurrences) cannot be used with headless data i.e. test/train file without head/target word(s).
Limits the scope of the training contexts to S1 words around (on both sides of) the TARGET word. Thus, it allows selection of local features. If --scope_train is used, each training instance is expected to include the target word as specified by the --target option or default target.regex.
Note: --scope_train cannot be used with headless data i.e. train files without head/target word(s).
Limits the scope of the test contexts to S2 words around (on both sides of) the TARGET word. Thus, it allows to match and use local features in the context vectors.
Note: --scope_test cannot be used with headless data i.e. test files without head/target word(s).
A file of Perl regexes that define the stop list of words to be excluded from the features.
STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.
Both modes exclude stop words from unigram features.
Default is OR mode.
Removes features that occur less than F times in the training corpus.
Specifies the window size for bigram/co-occurrence features. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram/co-occurrence features.
Default window size is 2 which allows only consecutive word pairs.
Not applicable to unigram features.
Bigrams and co-occurrences can be selected based on their statistical scores of association as specified by this option. If --vector = o2 and --stat is used, word association matrix will use the scores computed by the specified statistical test instead of simple joint frequency counts of the word pairs.
Available tests of association are :
dice - Dice Coefficient ll - Log Likelihood Ratio odds - Odds Ratio phi - Phi Coefficient pmi - Point-Wise Mutual Information tmi - True Mutual Information x2 - Chi-Squared Test tscore - T-Score leftFisher - Left Fisher's Test rightFisher - Right Fisher's Test
By default, features are selected and represented using their frequency counts.
Word pairs ranking below N when arranged in descending order of their test scores are ignored.
--stat_rank has no effect unless --stat is specified.
Selects word pairs with scores greater than S after performing the selected test of association. Score could be any real number that will give reasonable number of features for the requested test.
--stat_score has no effect unless --stat is specified.
Specifies the context representation to be used. Set ORD to 'o1' to use 1st order context vectors, and to 'o2' to select 2nd order context vectors. Default context representation is o2.
Creates binary feature and context vectors. By default, feature vectors show the joint frequency scores of the associated word pairs while the context vectors show the average of the feature vectors of words that occur in the context. With --binary turned ON, feature vectors show mere presence or absence of the particular word pair (co-occurrence/bigram) in TRAIN, while the context vectors will represent a binary 'OR' operation on the corresponding vectors of contextual features.
Reduces the feature space dimensions by performing Singular Value Decomposition (SVD). By default, all feature dimensions are retained.
Reduces the dimensions of the feature space to K. Default K = 300
Specifies the scaling factor for reducing feature space dimensions such that feature space with N dimensions is reduced down to N/RF. Default RF = 4. RF should be an integer greater than 1.
If both --k and --rf are specified, dimensions are reduced to min(k,N/RF).
Note: If the reduced dimensions ( min(k,N/RF) ) turn-out to be less than or equal to 10 then svd is not performed.
Specifies the number of iterations of SVD. Recommended value is 3 times the desired K.
Specifies the cluster stopping measure to be used to predict the number the number of clusters.
The possible option values: pk1 - Use PK1 measure [ PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM])) ] pk2 - Use PK2 measure [ PK2[m] = (crfun[m]/crfun[m-1]) ] pk3 - Use PK3 measure [ PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1])) ] gap - Use Adapted Gap Statistic. pk - Use all the PK measures. all - Use all the four cluster stopping measures.
More about these measures can be found in the documentation of Toolkit/clusterstop/clusterstopping.pl
NOTE: Options --cluststop and --clusters (described under Clustering options) cannot be used together.
NOTE: Delta value can only be a positive integer value.
Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers.
Specify non-zero positive integer to stop the iterating clustering process when the difference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1.
However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10
Specifies the threshold value that should be used by the PK1 measure to predict the k value. Default = -0.7
NOTE: This option should be used only when --cluststop option is also used with option value of "all" or "pk1".
The number of replicates/references to be generated. Default: 1
Specifies whether to generate B replicates from a reference or to generate B references.
The possible option values: rep - replicates [Default] ref - references
Specifies the percentage confidence to be reported in the log file. Since Gap Statistic uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty. Default: 90
The seed to be used with the random number generator. Default: No seed is set.
Specifies number of clusters to be created. Default is set to 2.
Specifies whether clustering is to be performed in vector or similarity space. Set the value of SPACE to 'vector' to perform clustering in vector space i.e. to cluster the context vectors directly. To cluster in similarity space by explicitly finding the pair-wise similarities among the contexts, set SPACE to 'similarity'.
By default, clustering is performed in vector space.
Specifies the clustering method.
Possible option values are :
rb - Repeated Bisections [Default] rbr - Repeated Bisections for by k-way refinement direct - Direct k-way clustering agglo - Agglomerative clustering graph - Graph partitioning-based clustering bagglo - Partitional biased Agglomerative clustering
For large amount of data, 'rb', 'rbr' or 'direct' are recommended.
Selects the criteria function for Clustering. The meanings of these criteria functions are explained in Cluto's manual.
The possible values are:
i1 - I1 Criterion function i2 - I2 Criterion function [default for partitional] e1 - E1 Criterion function g1 - G1 Criterion function g1p - G1' Criterion function h1 - H1 Criterion function h2 - H2 Criterion function slink - Single link merging scheme wslink - Single link merging scheme weighted w.r.t. cluster sim clink - Complete link merging scheme wclink - Complete link merging scheme weighted w.r.t. cluster sim upgma - Group average merging scheme [default for agglomerative]
Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions can only be used. If a crfun other than these is selected then cluster stopping uses the default crfun (i2) while the final clustering of contexts is performed using the crfun specified.
Specifies the similarity measure to be used for either vector or similarity space clustering.
When --space = vector (or default), possible values of SIM are :
cos - Cosine [default] corr - Correlation Coefficient dist - Euclidean distance jacc - Extended Jaccard Coefficient
When --space = similarity and --binary is ON, possible values of SIM are -
cos - Cosine [default] mat - Match jac - Jaccard ovr - Overlap dic - Dice
Otherwise, only cosine measure is available and is default.
The following table summarizes availability of similarity measures for 2 clustering approaches - vector(vcl) and similarity(scl) and on 2 different types of context vectors - binary Vs frequency
vcl+bin vcl+freq scl+bin scl+freq cos Y Y Y Y mat N N Y N jacc Y Y Y N dice N N Y N ovr N N Y N dist Y Y N N corr Y Y N N
The reasons are purely implementation issues and in future, we plan to support more consistent measures across these combinations.
The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual)
The possible values for RMOD - none - no scaling is performed (default setting) maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values
The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual)
The possible values for CMOD - none - no scaling is performed (default setting) idf - scaling according to inverse-document-frequency
Note: Labeling options cannot be used with word-clustering (--wordclust).
LABEL_STOPFILE could be specified with two modes -
AND mode - declared by including '@stop.mode=AND' on the first line of the LABEL_STOPFILE - ignores word pairs in which both words are stop words.
OR mode - declared by including '@stop.mode=OR' on the first line of the LABEL_STOPFILE - ignores word pairs in which either word is a stop word.
Default is OR.
Removes bigrams that occur less than LABEL_N times.
Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window LABEL_W allows at most LABEL_W-2 intervening words) will form the bigram features. Default window size is 2 which allows only consecutive word pairs.
Specifies the statistical scores of association.
Word pairs ranking below LABEL_R when arranged in descending order of their test scores are ignored.
Evaluates clustering performance by computing precision and recall for maximally accurate assignment of sense tags to clusters. Maximal Assignment is when clusters are given sense labels such that maximum number of instances will be attached with their true sense tags.
TEST instances tagged with multiple senses are automatically attached with the single sense-tag that is the most frequent among the attached tags.
Note: This option can be used only if the answer tags are provided in the TEST file.
Allows to remove low frequency senses during evaluation. This will remove the senses that rank below R when senses in TEST are arranged in the descending order of their frequencies. In other words, it selects top R most frequent senses. An instance will be removed if it has all sense tags below rank R.
Allows to remove low frequency senses based on their percentage frequencies. This will remove senses whose frequency is below P% in the TEST data.
If rank or percent filters are specified, they are applied after removing the multiple sense tags.
Displays the quick summary of program options.
Displays the version information.
Displays to STDERR the current program status.
Displays to STDOUT values of compulsory and required parameters. [NOT SUPPORTED IN THIS VERSION]
discriminate.pl creates several output files. The discrimination of contexts performed by discriminate.pl, (i.e., a cluster assigned to each context) is given by the file $PREFIX.clusters if the number of clusters was set manually, otherwise by the file $PREFIX.clusters.$CLUSTSTOP where the $CLUSTSTOP specifies the cluster stopping measure that was used to predict the number of clusters.
In addition, discriminate.pl also creates following files:
NOTE: If a cluster stopping measure was used then it is indicated in the names of several output files by appending the cluster stopping measure name with the file name. Represented below as filename[.$CLUSTSTOP]
$PREFIX.clusters_context[.$CLUSTSTOP] - File containing all the input instances grouped by the cluster-id assigned to them.
$PREFIX[.$CLUSTSTOP].cluster.CLUSTERID - All the identified clusters and their instances are separated into different files. The filenames end with the cluster-id. e.g.: File containing instances of cluster 0 will be named as $PREFIX.cluster.0
$PREFIX.report[.$CLUSTSTOP] - Confusion table if --eval is ON
$PREFIX.cluster_labels[.$CLUSTSTOP] - List of labels (word-pairs) assigned to each cluster.
$PREFIX[.$CLUSTSTOP].dendogram.ps - Dendograms + some information.
$PREFIX.features - Features file
$PREFIX.regex - File containing regular expressions for identifying the features listed in $PREFIX.features file.
$PREFIX.testregex - File containing only those regular expressions from the $PREFIX.regex file above, which match at least once in the test contexts, only created in second order context clustering mode (SC native as well as LSA) and LSA feature clustering mode
$PREFIX.wordvec - Word Vectors if --context = o2
$PREFIX.vectors - Context Vectors
$PREFIX.rlabel - Row Labels of $PREFIX.vectors
$PREFIX.clabel - Column Labels of $PREFIX.vectors
$PREFIX.rclass - Class Ids of $PREFIX.vectors if --eval is ON
$PREFIX.cluster_solution[.$CLUSTSTOP] - Cluster ids of $PREFIX.vectors
$PREFIX.cluster_output[.$CLUSTSTOP] - Clustering program output
$PREFIX.pk1 - crfun[k] values, delta values, PK1[k] values and predicted k value
$PREFIX.pk2 - crfun[k] values, delta values, PK2[k] values and predicted k value
$PREFIX.pk3 - crfun[k] values, delta values, PK3[k] values and predicted k value
$PREFIX.gap - crfun[k] values, delta values and predicted k value
$PREFIX.gap.log - Gap(k), Obs(crfun(k)), Exp(crfun(k)) values etc.
$PREFIX.cr.dat - value-pairs :- k-value crfun-value
$PREFIX.pk1.dat - value-pairs :- k-value PK1[k] value
$PREFIX.pk2.dat - value-pairs :- k-value PK2[k] value
$PREFIX.pk3.dat - value-pairs :- k-value PK3[k] value
$PREFIX.gap.dat - value-pairs :- k-value Gap[k] value
$PREFIX.exp.dat - value-pairs :- k-value Exp(crfun[k]) value
Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu Amruta Purandare, University of Pittsburgh Anagha Kulkarni, Carnegie-Mellon University Mahesh Joshi, Carnegie-Mellon Unversity
Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
To install Text::SenseClusters, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::SenseClusters
CPAN shell
perl -MCPAN -e shell install Text::SenseClusters
For more information on module installation, please visit the detailed CPAN module installation guide.