The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

frequency.pl - Compute the distribution of senses in a Senseval-2 data file

SYNOPSIS

 frequency.pl [OPTIONS] SOURCE

You can find begin.v-test.xml in samples/Data

 frequency.pl begin.v-test.xml

Output =>

 <sense id="begin%2:30:00::" percent="64.31"/>
 <sense id="begin%2:30:01::" percent="14.51"/>
 <sense id="begin%2:42:04::" percent="21.18"/>
 Total Instances = 255
 Total Distinct Senses=3
 Distribution={64.31,21.18,14.51}
 % of Majority Sense = 64.31

Type frequency.pl --help for a quick summary of options

DESCRIPTION

Displays distribution of senses in a given Senseval-2 file to STDOUT. This information can be used to better understand the data, and also to decide to filter low frequency senses (using filter.pl) or balance the distribution of senses (using balance.pl).

INPUT

Required Arguments:

SOURCE

SOURCE should be a Senseval-2 formatted file. The sense ids are searched by matching a regex /sense\s*id="S"/.

An instance having multiple sense ids should appear only once with multiple <answer> tags. e.g. If an instance IID has 2 sense ids SID1 and SID2, then in the SOURCE file, instance IID should be formatted as -

 <instance id="IID"> 
 <answer instance="IID" senseid="SID1"/>
 <answer instance="IID" senseid="SID2"/>
 <context>
        Context Data comes here ....
 </context>
 </instance>

Optional Arguments:

--help

Displays this message.

--version

Displays the version information.

OUTPUT

Output displays

1. Total number of instances in SOURCE

These are counted by matching regex /instance id=\"ID\"/ for unique instance ids.

2. Total number of distinct sense tags found in SOURCE

These are searched by matching a regex /sense\s*id="S"/.

3. Sense Distribution

Output shows

<sense id="S" percent="P"/>

for each sense id found in SOURCE. P is the percentage frequency of the sense S.

4. % of Majority sense

This will be the highest sense percentage found in SOURCE.

Sample Output

 <sense id="begin%2:30:00::" percent="59.49"/>
 <sense id="begin%2:30:01::" percent="13.38"/>
 <sense id="begin%2:42:00::" percent="4.70"/>
 <sense id="begin%2:42:03::" percent="3.44"/>
 <sense id="begin%2:42:04::" percent="18.99"/>
 Total Instances = 548
 Total Distinct Senses=5
 Distribution={59.49,18.99,13.38,4.70,3.44}
 % of Majority Sense = 59.49

Shows that there are total 548 instances and 5 senses.

The senses are distributed with frequencies

{59.49,18.99,13.38,4.70,3.44}

where majority sense has frequency = 59.49

The <sense> tags show the frequency of each individual tag.

AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare,  University of Pittsburgh 

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to :

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.