The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

filter.pl - Remove the instances of low frequency sense tags from a Senseval-2 data file

SYNOPSIS

 filter.pl [OPTIONS] DATA FREQUENCY_OUTPUT

Determine the distribution of senses in the given Senseval-2 input file

 frequency.pl begin.v-test.xml > freq-output

 frequency.pl freq-output

Output =>

 <sense id="begin%2:30:00::" percent="64.31"/>
 <sense id="begin%2:30:01::" percent="14.51"/>
 <sense id="begin%2:42:04::" percent="21.18"/>
 Total Instances = 255
 Total Distinct Senses=3
 Distribution={64.31,21.18,14.51}
 % of Majority Sense = 64.31

Filter any sense that occurs in less than 1% of the instances (there are none in this data, so frequency output is unchanged)

 filter.pl begin.v-test.xml freq-output >fil-output

 frequency.pl fil-output

Output =>

 <sense id="begin%2:30:00::" percent="64.31"/>
 <sense id="begin%2:30:01::" percent="14.51"/>
 <sense id="begin%2:42:04::" percent="21.18"/>
 Total Instances = 255
 Total Distinct Senses=3
 Distribution={64.31,21.18,14.51}
 % of Majority Sense = 64.31

Keep only the top 2 ranked (most frequent) senses

 filter.pl --rank 2 begin.v-test.xml freq-output > fil-output

 frequency.pl fil-output

Output =>

 <sense id="begin%2:30:00::" percent="75.23"/>
 <sense id="begin%2:42:04::" percent="24.77"/>
 Total Instances = 218
 Total Distinct Senses=2
 Distribution={75.23,24.77}
 % of Majority Sense = 75.23

Keep all senses that occur in at least 20% of the instances in the original data

 filter.pl --p 20 begin.v-test.xml freq-output > fil-output

 frequency.pl fil-output

Output =>

 <sense id="begin%2:30:00::" percent="75.23"/>
 <sense id="begin%2:42:04::" percent="24.77"/>
 Total Instances = 218
 Total Distinct Senses=2
 Distribution={75.23,24.77}
 % of Majority Sense = 75.23

You can find begin.v-test.xml in samples/Data

Type filter.pl --help for a quick summary of available options.

DESCRIPTION

This program will remove low frequency sense tags from a Senseval-2 data set by specifying a percentage or rank threshhold. By default it removes any sense tag associated with less than 1% of the total instances. Output is to STDOUT, so the original input data file is unchanged.

INPUT

Required Arguments:

filter.pl requires two compulsory arguments -

DATA

Senseval-2 formatted data file that is to be filtered.

FREQUENCY_OUTPUT

This should be an output created by program frequency.pl of this package that shows percentage frequency of each sense tag appearing in given DATA. FREQUENCY_OUTPUT should be created by running frequency.pl on the same DATA file that is input to filter.

This should show tags

       <sense id="S" percent="P"/>

that specify percent of each sense tag S in the DATA file.

Optional Arguments:

Filter Options:

--percent P

With this option, user can specify the percentage cutoff for filtering. When --percent is specified, filter.pl will remove all sense tags whose frequency in FREQUENCY_OUTPUT is below P %. A DATA instance that has all sense tags attached to it below P% is removed. In other words, only those DATA instances are retained which have atleast one sense tag with frequency more than or equal to P%.

--rank R

With this option, user can specify the rank cutoff for filtering. When --rank is specified, filter.pl will remove those sense tags that are ranked below R when senses are ordered according to their percentages. A DATA instance that has all sense tags attached to it below the rank R will be removed. In other words, only those DATA instances are retained which have atleast one sense tag above rank R.

filter.pl allows only one of the above filter conditions to be specified.

If neither of the filter options is specified, it will set the default filter condition as P = 1 and will filter DATA by removing sense tags less then 1%.

--nomulti

Removes multiple sense tags attached to an instance such that each instance is tagged with the most frequent sense tag among the tags attached to it.

Other Options :

--count COUNT

Filters the corresponding COUNT file created by preprocess.pl along with the DATA file. COUNT file is filtered such that it stays consistent with the new filtered DATA file and contains only those instances left after filtering, in the same order as they appear in the output.

Filtered COUNT is written to file COUNT.filtered and every ith line in COUNT.filtered shows the instance data within <context> and </context> tags for the ith instance in the output of filter.

--help

Displays this message.

--version

Displays the version information.

OUTPUT

Output is a sense filtered Senseval-2 file that shows only those DATA instances which have at least one sense tag left after filtering.

AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare, University of Pittsburgh

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.