Ted Pedersen > Text-SenseClusters > windower.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

windower.pl - Limit window of context around a target word specified in a Senseval-2 input file

SYNOPSIS ^

Suppose we have a very small Senseval-2 file (small-test.xml) with just 2 instances. We would like to limit the surrounding context to 5 words to the left and 5 words to the right of the target word:

 windower.pl small.xml 5

Output =>

 <?xml version="1.0" encoding="iso-8859-1" ?>
 <corpus lang='english' tagged="NO">
 <lexelt item="begin.v">
 <instance id="begin.555">
 <answer instance="begin.555" senseid="begin%2:30:01::"/>
 <context>
 greats hardly knowns and unknowns <head>begin</head> a game three month season
 </context>
 </instance>
 <instance id="begin.557">
 <answer instance="begin.557" senseid="begin%2:30:01::"/>
 <context>
 late november it expects to <head>begin</head> construction by year end and
 </context>
 </instance>
 </lexelt>
 </corpus>

This is from the first two lines of the file begin.v-test.xml. You can see the full contexts at /samples/Data.

Type windower.pl --help for a quick summary of options

DESCRIPTION ^

Limits the contexts of given instances to W tokens around the target word.

USAGE ^

windower.pl [OPTIONS] SVAL2 W

INPUT ^

Required Arguments:

SVAL2

SVAL2 must be a tokenized and preprocessed instance file in the Senseval-2 format.

W

Should be a positive integer number specifying the window size. windower will display only the tokens that appear in the window of [-W, +W] centered around the target word.

Optional Arguments:

--plain

Output will be displayed in plain text format showing context of each instance on a single separate line. i.e. each i'th line on stdout will show the context of the i'th instance in the given SVAL2 file. By default, output is created in Senseval-2 format.

--token TOKENREGEX

TOKENREGEX should be a file containing Perl regular expressions that define the tokenization scheme in SVAL2. windower recognizes only those character sequences from SVAL2 that match the specified token regex/s, everything else will be ignored. If --token is not specified, windower searches the default token.regex file in the current directory.

--target TARGETREGEX

Specify a file containing Perl regular expressions that define the target word/s. Target words must be valid tokens recognizable by the specified tokenization scheme (via --token or token.regex)

Following are some of the examples of TARGET word regex files -

  1.  /<head>[Ll]ines?<\/head>/

    which specifies that the target word could be

     line, Line, lines or Lines 

    delimited in <head> and </head> tags.

  2. Above regex can also be specified as multiple regexes in TARGET as -
     /<head>line<\/head>/
    
     /<head>lines<\/head>/
    
     /<head>Line<\/head>/
    
     /<head>Lines<\/head>/

    with a single regex per line

  3. Regex
     /<head>\w+<\/head>/

    shows a more general regex for target words marked in <head> tags

  4. Regex
     /<head.*>\w+<\/head>/

    Shows the regex for matching target words in the original Senseval-2 data.

  5.  /[Ll]ines?/

    shows that any occurrence of words - Line, line, Lines, lines are target words (that are not delimited in any special tags).

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

When --plain is not selected, OUTPUT is in Senseval-2 format that looks same as the input SVAL2 file except the context of each instance shows atmost W words around the target word.

When --plain is ON, OUTPUT shows each context on a single line i.e. context of i'th instance in the given SVAL2 file is shown on the i'th line on stdout.

AUTHORS ^

Amruta Purandare, University of Pittsburgh

Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu

COPYRIGHT ^

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

syntax highlighting: