Ted Pedersen > Text-SenseClusters-1.03 > maketarget.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

maketarget.pl - Create target.regex file for a given Senseval-2 data file that shows all the forms of the target word

SYNOPSIS ^

 maketarget.pl -head begin.v-test.xml

This creates a file called target.regex with the following contents:

 /<head>\s*(began)|(begin)|(beginning)|(begins)|(begun)\s*</head>/

 maketarget.pl begin.v-test.xml

This creates a file called target.regex with the following contents:

 /(\bbegan\b)|(\bbegin\b)|(\bbeginning\b)|(\bbegins\b)|(\bbegun\b)/

These are regular expressions that show all the forms of "begin" that appear in the given Senseval-2 data file with and without a surrounding head tag.

You can find begin.v-test.xml at samples/Data

Type maketarget.pl for a quick list of options

DESCRIPTION ^

This program creates a Perl regex for the TARGET word by detecting its various forms from the given SVAL2 file.

This program will create a regular expression file called target.regex that can be used to match target words via the --target option in many SenseClusters programs. The target.regex file can be of two forms:

 /<head>\s*(target1|target2)\s*</head>/

or

 /(\btarget1\b)|(\btarget2\b)/

The first form is appropriate when the corpus already has the target word marked with head tags, while the second should be used when the corpus is plain unannotated text. The second form is the default, while the first is available with the --head option. Note that in the first form the <head> tag acts as a delimiter on word boundaries, while in the second form the \b character class is used for that purpose.

INPUT ^

Required Arguments:

SVAL2

Should be a file in Senseval-2 format from which various possible forms of the TARGET word are to be detected.

Optional Arguments:

--head

Create target word regex in the form: <head>\s*(target1|target2)\s*</head>

--help

Displays the summary of command line options.

--version

Displays the version information.

OUTPUT ^

maketarget.pl automatically creates the file with name 'target.regex' that shows the Perl regex for the TARGET word. The regex is a OR of various forms of the word detected placed within a single regex, optionally surrounded by <head> and </head> tags.

For example: Contents of a sample <target.regex> file:

 /(\bLine\b)|(\bLines\b)|(\bline\b)|(\blined\b)|(\blines\b)/ (default)

 /<head>\s*(Line)|(Lines)|(line)|(lined)|(lines)\s*</head>/ (with --head)

BUGS ^

This program does not recognize target words of the form:

 <head> Bill Clinton </head>

It is restricted to target words that are a single string, such as

 <head> Bill_Clinton </head>

AUTHORS ^

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare, University of Pittsburgh

 Anagha Kulkarni, Carnegie-Mellon University

COPYRIGHT ^

Copyright (c) 2002-2008, Ted Pedersen, Amurta Purandare, Anagha Kulkarni

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: