README.Toolkit Description of SenseClusters Toolkit directory structure
This briefly describes the structure of the Toolkit directory,
and gives a brief idea of what each program does.
Directories are indicated with a / at the end of their name (preprocess/) while programs end with the .pl suffix.
All of this is contained in the Toolkits/ directory.
Note that these are organized roughly in the order in which they will be used by SenseClusters.
Please review the flowcharts found in doc/Flowcharts for additional information.
- plain/ (processes input in plain text format)
- text2sval.pl - Convert simple plain text into Senseval2 format
- sval2/ (processes input in Senseval-2 format)
- balance.pl - Balances sense distribution in a Senseval-2 input file by removing some instances
- filter.pl - Removes instances associated with low frequency sense tags from Senseval-2 input
- frequency.pl - Displays frequency distribution of senses
- keyconvert.pl - Convert KEY file from Senseval-2 format to SenseCluster's format
- maketarget.pl - Create a Perl regex for the target word by spotting all <head> tags in the given file
- prepare_sval2.pl - Prepare Senseval-2 data for experiments
- preprocess.pl - Tokenize and optionally split Senseval-2 input into training and test portions
- sval2plain.pl - Convert a Senseval-2 input file to plain text format
- windower.pl - Cut a window of context W words big around a target word in a given Senseval-2 input file
- reduce-count.pl - Reduce the size of the Text-NSP output created with huge training data
- bitsimat.pl - Create a similarity matrix for given bit vectors
- simat.pl - Create a similarity matrix for given non-binary (integer or real) vectors
- nsp2regex.pl - Creates regular expressions from Text-NSP output to represent features
- order1vec.pl - Creates first order context vectors
- order2vec.pl - Creates second order context vectors
- wordvec.pl - Creates word vectors from Text-NSP output
- mat2harbo.pl - Convert matrices from SenseClusters format to Harwell-Boeing format
- svdpackout.pl - Reconstruct a matrix from its singular vectors as found by by SVDPACKC
- clusterstopping.pl - Predicts the number of clusters that a given data should be divided into.
Provides three such cluster stopping measures.
- cluto2label.pl - Convert clustering output of Cluto to a cluster by sense confusion matrix for evaluation
- format_clusters.pl - Display contexts that were clustered with assigned sense id,
or display senseval-2 format with assigned sense id
- label.pl - Assign sense tags to the discovered clusters for evaluation
- report.pl - Report performance in terms of the precision,
recall,
and F-Measure,
and show a confusion matrix
- clusterlabeling.pl - Selects significant word-pairs from the contents/instances of the clusters and assigns them as the labels to the clusters.
Also creates separate file for each cluster.
This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784).
Copyright 2003-2008,
Ted Pedersen
Permission is granted to copy,
distribute and/or modify this document under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections,
no Front-Cover Texts,
and no Back-Cover Texts.
Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.