View on
Ted Pedersen > Text-SenseClusters-1.05 >


Annotate this POD


Open  0
View/Report Bugs

NAME ^ - Preprocess Senseval-2 data for sample experiments


A Perl script that preprocesses and prepares DATA for experiments with SenseClusters.

USAGE ^ [Options] DATA

Type ' --help' for quick summary of options


Required Arguments:


SenseClusters requires input DATA in Senseval-2 format. DATA in any other format has to be first converted to this format. We provide with this distribution a pre-processing program Toolkit/preprocess/plain/ that converts data in plain text format (with the context of single instance on each line) into Senseval-2 format.

SenseClusters uses an unsupervised clustering approach and hence doesn't require DATA to be sense tagged at all. However, if the true sense classes of the DATA instances are available, those could be used for evaluation.

Optional Arguments:

Data Options (--training | --split)?

Data Options specify the input DATA to There are three different possibilities, which can be denoted via the following regex: DATA (--training TRAIN or --split P)?

1. If neither --training nor --split options are used, DATA will be clustered using the features extracted from the same DATA file.

2. If a separate Training file is provided via --training TRAIN option, DATA will be clustered using features extracted from the given TRAIN file. TRAIN file is expected to be in the Senseval-2 format.

3. If DATA file is provided with the --split P option, (100-P)% of DATA will be clustered using features extracted from the rest P% DATA.

Thus, DATA can be provided in a single DATA file, or with the --split P option or with a separate training file via --training TRAIN option. Options --split and --training can't be both used together.

Sense Tag Options :

If the correct answer tags of the DATA instances are known, these can be used for performing some special tasks like evaluating results or filtering low frequency senses.

--key KEY

Specifies the true sense tags of the DATA instances.

Each line in the KEY file should show

   <instance id=\"I\"\/>  [<sense id=\"S\"\/>]+

where an Instance-Id is followed by its true sense tag/s.

KEY file in any other format has to be first converted to this format. We provide with this distribution a script called (in Toolkit/preprocess/sval2) to convert a KEY file in Senseval-2 format (such as fine.key) to SenseClusters' KEY format.

If the KEY file is not specified and if any sense-tag options like evaluation or sense-filter are used in wrapper, SenseClusters assumes that the sense ids are embedded in the same DATA file and these will be searched by matching an expression


assuming that SID shows a true sense id of an immediately preceding instance ID matched by /instance id=\"ID\"/ expression.

Tokenization Options :


TOKENFILE is a file of Perl regexes that define tokenization scheme in DATA. A sample regex file, token.regex is provided with this distribution and if the user doesn't specify the TOKENFILE via --token option, token.regex will be searched in the current directory.


NONTOKENFILE is a file of Perl regexes that define strings that will be removed prior to tokenization. If NONTOKENFILE is not specified, only those string sequences that are not tokens will be removed.

Other Options :


Displays to STDERR the current program status. Silent by default.


Displays to STDOUT values of required and option arguments.


Displays this message.


Displays the version information.

OUTPUT ^ preprocesses given DATA in following ways -

  1. Creates a LexSample directory that contains a separate directory for each <lexelt> found in the DATA file. Each LEXELT directory (found within LexSample) will have following files -
    • LEXELT-test.xml

      A test XML file containing instances of single 'LEXELT' from a given DATA file. If --split P is specified, LEXELT-test.xml will have (1-P)% of the instances in DATA with the value of <lexelt> tag as LEXELT. Otherwise, will have all DATA instances that come under the LEXELT item.

    • LEXELT-test.count

      A count file containing instance data within <context> and </context> tags on a single line for each instance appearing in corresponding LEXELT-test.xml.

    • [LEXELT-training.xml]

      This file is created only if --training or --split options are used. If --split P is used, this will contain P% of the DATA instances having <lexelt> tag value = LEXELT. Otherwise, this file will have all instances from TRAINING that come under the LEXELT tag.

    • [LEXELT-training.count]

      This is created only if its equivalent LEXELT-training.xml is created and contains instance data within <context> </context> tags on a single line for each instance appearing in LEXELT-training.xml.

    Additionally, LexSample directory will have files token.regex and nontoken.regex if --nontoken option is used.

  2. Converts data within <context> and </context> tags to lowercase.
  3. If KEY file is specified via --key option, answer tags are put along with the instances in the corresponding LEXELT files. These will be accessed only during the sense tag options like evaluation or filtering low frequency senses. Sense tags are ignored during clustering and feature selection. The inclusion of sense tags in the same XML files is only meant for sake of convenience for programming.

SYSTEM REQUIREMENTS ^ uses a preprocessing program which is included in SenseClusters

SEE ALSO ^ uses the training and test files created by


 Amruta Purandare, University of Pittsburgh

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at


Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: