Ted Pedersen > Text-SenseClusters > prepare_sval2.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

prepare_sval2.pl - Makes sure Senseval-2 data is cleaned and has sense tags prior to invocation of SenseClusters

SYNOPSIS ^

 prepare_sval2.pl [Options] SOURCE

Here is a Senseval-2 file that is untagged

 cat notags.txt

Output =>

 <corpus lang="english">
 <lexelt item="line">
 <instance id="0">
 <context>
 he played on the offensive <head>line</head> in college
 </context>
 </instance>
 <instance id="1">
 <context>
 i think the phone <head>line</head> is down
 </context>
 </instance>
 </lexelt>
 </corpus>

Here is a key file that contains sense tags for these instances:

 cat key.txt

Output =>

 <instance id="0"/> <sense id="formation"/>
 <instance id="1"/> <sense id="cable"/>

Now we can apply the tags in the key file to the previously untagged instances:

 prepare_sval2.pl notags.txt --key key.txt

Output =>

 <corpus lang="english" tagged="NO">
 <lexelt item="line">
 <instance id="0">
 <answer instance="0" senseid="formation"/>
 <context>
 he played on the offensive <head>line</head> in college
 </context>
 </instance>
 <instance id="1">
 <answer instance="1" senseid="cable"/>
 <context>
 i think the phone <head>line</head> is down
 </context>
 </instance>
 </lexelt>
 </corpus>

Type prepare_sval2.pl --help for quick summary of options

DESCRIPTION ^

This program prepares Senseval-2 Data for SenseClusters experiments by making sure that all instances have sense tags. Sense tags can be applied from a separate key file, and if any instances do not have tags, then a NOTAG is inserted. This program also deals with P tags that may exist in some Senseval data. The P tag indicates that the target word is a proper noun. In may cases P tagged instances are ommited from experiments since they are a different kind of sense. If "bush" were the target word, some instances might refer to "George Bush", which may not be one of the senses we wish to evaluate. Finally, this program can also deal with satellite tags that exist in some Senseval data. When the target word is a verb, in some cases it may have a satellite (particle), that we may or may not want to consider as a part of the target word. The satellite tags have identifiers in them that may cause parsing trouble, so they are often removed.

INPUT ^

Required Arguments:

SOURCE

A Senseval-2 formatted Data file that is to be prepared for the SenseClusters experiments.

Optional Arguments:

--key KEY

Sense Tagging mechanism in prepare_sval2.pl -

prepare_sval2.pl makes sure that all SOURCE instances are tagged with some answer tags (or NOTAGs at least).

If the sense tags are found in the same SOURCE file, these will be retained, however if the SOURCE instances are not tagged, instances will be either attached "NOTAG"s or will be attached the sense tags given in the separate KEY file.

A KEY file that has true answer keys of the SOURCE instances can be provided via --key option. If the SOURCE instances are not sense tagged, they will be tagged with the sense tags as given in the KEY file.

KEY file should be in SenseClusters format showing

                <instance id="I"/>  [<sense id="S"/>]+

on each line where an instance id is followed by its true sense ids on a single line.

prepare_sval2 takes into account following anamolies in SOURCE/KEY -

  1. If the 1st SOURCE instance is sense tagged, it assumes that SOURCE is sense tagged and will disable the KEY file option. If some of the SOURCE instances are not tagged, regardless of whether they have keys in KEY file or not, these are given "NOTAG"s.
  2. If the 1st SOURCE instance is not sense tagged, it assumes that SOURCE is untagged and will give an error if any SOURCE instance is found sense tagged in the SOURCE file.
  3. If the 1st SOURCE instance is not sense tagged and has an entry in the KEY file, it will enable the KEY file and will attach the instances with their answer keys as given in the KEY file. Any instance that doesn't have an answer key in the KEY file is attached "NOTAG".
  4. If the 1st SOURCE instance is not sense tagged and doesn't have an entry in the KEY file, KEY file will be disabled and no instance will be attached a tag from the KEY file. All instances are given "NOTAG"s.

--attachP

P tag handling mechanism in prepare_sval2.pl -

prepare_sval2.pl by default removes the sense tags that have value P. According to Senseval-2 standard, these are not true sense tags but indicate that the target word is a proper noun.

--attachP option will attach a P tag to an immediately following sense tag for the same instance.

e.g. If --attachP is selected,

 <instance id="art.40012" docsrc="bnc_A0E_130">
 <answer instance="art.40012" senseid="P"/>
 <answer instance="art.40012" senseid="arts%1:09:00::"/>

will be modified to

 <instance id="art.40012" docsrc="bnc_A0E_130">
 <answer instance="art.40012" senseid="P_arts%1:09:00::"/>

and if --attachP is not selected, by default P tag will be removed as

 <instance id="art.40012" docsrc="bnc_A0E_130">
 <answer instance="art.40012" senseid="arts%1:09:00::"/>

--modifysat

This switch if selected will remove the satellite tag ids from <head sats=" ID"/> and <sat id="ID"/> tags, retaining basic <head> and <sat> tag information.

e.g. by selecting --modifysat,

 Perhaps he 'd have <head sats="call_for.018:0">called</head> <sat
 id="call_for.018:0">for</sat> a decentralized political and economic
 system

will be transformed to

 perhaps he 'd have <head> called </head> <sat> for </sat> a 
 decentralized political and economic system

By not selecting --modifysat, the satellite ids would be retained.

--nolc

prepare_sval2 converts everything to lowercase by default. Select this switch to not do any case conversion.

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

Output will be a Senseval-2 file displayed to stdout.

AUTHORS ^

 Amruta Purandare, University of Pittsburgh

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT ^

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: