The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

windower.pl - Limit window of context around a target word specified in a Senseval-2 input file

SYNOPSIS

Suppose we have a very small Senseval-2 file (small-test.xml) with just 2 instances. We would like to limit the surrounding context to 5 words to the left and 5 words to the right of the target word:

 windower.pl small.xml 5

Output =>

 <?xml version="1.0" encoding="iso-8859-1" ?>
 <corpus lang='english' tagged="NO">
 <lexelt item="begin.v">
 <instance id="begin.555">
 <answer instance="begin.555" senseid="begin%2:30:01::"/>
 <context>
 greats hardly knowns and unknowns <head>begin</head> a game three month season
 </context>
 </instance>
 <instance id="begin.557">
 <answer instance="begin.557" senseid="begin%2:30:01::"/>
 <context>
 late november it expects to <head>begin</head> construction by year end and
 </context>
 </instance>
 </lexelt>
 </corpus>

This is from the first two lines of the file begin.v-test.xml. You can see the full contexts at /samples/Data.

Type windower.pl --help for a quick summary of options

DESCRIPTION

Limits the contexts of given instances to W tokens around the target word.

USAGE

windower.pl [OPTIONS] SVAL2 W

INPUT

Required Arguments:

SVAL2

SVAL2 must be a tokenized and preprocessed instance file in the Senseval-2 format.

W

Should be a positive integer number specifying the window size. windower will display only the tokens that appear in the window of [-W, +W] centered around the target word.

Optional Arguments:

--plain

Output will be displayed in plain text format showing context of each instance on a single separate line. i.e. each i'th line on stdout will show the context of the i'th instance in the given SVAL2 file. By default, output is created in Senseval-2 format.

--token TOKENREGEX

TOKENREGEX should be a file containing Perl regular expressions that define the tokenization scheme in SVAL2. windower recognizes only those character sequences from SVAL2 that match the specified token regex/s, everything else will be ignored. If --token is not specified, windower searches the default token.regex file in the current directory.

--target TARGETREGEX

Specify a file containing Perl regular expressions that define the target word/s. Target words must be valid tokens recognizable by the specified tokenization scheme (via --token or token.regex)

Following are some of the examples of TARGET word regex files -

  1.  /<head>[Ll]ines?<\/head>/

    which specifies that the target word could be

     line, Line, lines or Lines 

    delimited in <head> and </head> tags.

  2. Above regex can also be specified as multiple regexes in TARGET as -

     /<head>line<\/head>/
    
     /<head>lines<\/head>/
    
     /<head>Line<\/head>/
    
     /<head>Lines<\/head>/

    with a single regex per line

  3. Regex

     /<head>\w+<\/head>/

    shows a more general regex for target words marked in <head> tags

  4. Regex

     /<head.*>\w+<\/head>/

    Shows the regex for matching target words in the original Senseval-2 data.

  5.  /[Ll]ines?/

    shows that any occurrence of words - Line, line, Lines, lines are target words (that are not delimited in any special tags).

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT

When --plain is not selected, OUTPUT is in Senseval-2 format that looks same as the input SVAL2 file except the context of each instance shows atmost W words around the target word.

When --plain is ON, OUTPUT shows each context on a single line i.e. context of i'th instance in the given SVAL2 file is shown on the i'th line on stdout.

AUTHORS

Amruta Purandare, University of Pittsburgh

Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.