The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

reduce-count.pl - Reduce size of feature space by removing words not in evaluation data

SYNOPSIS

reduce-count.pl [OPTIONS] BIGRAM UNIGRAM

The features found in training data are defined in a bigram file 'bigram':

 cat bigram

Output =>

 1491
 at<>least<>3 7 3
 co<>occurrences<>3 5 3
 be<>a<>3 13 25
 General<>Public<>3 3 3
 of<>test<>3 41 9
 file<>bigfile<>3 26 6
 given<>set<>3 7 5

The unigrams that occur in the evaluation data are defined in a file 'unigram' :

 cat unigram

Output =>

 at<>
 be<>
 test<>

Now remove any bigram that does not contain at least one of the words in the unigram file:

 reduce-count.pl cout uni

Output =>

 1491
 at<>least<>3 7 3
 be<>a<>3 13 25
 of<>test<>3 41 9

Type reduce-count.pl for a quick summary of options

DESCRIPTION

This program removes all bigrams from the given BIGRAM file that do not include at least one constituent word from the UNIGRAM file. Note that this can also be applied on a co-occurrence file.

The intent of this in SenseClusters is to allow a user to significantly reduce the number of features in a BIGRAM file by including only those that contain at least one word from the test data. In this use case, a user would have a BIGRAM file of features found in training data, and would also have a UNIGRAM file from a set of test data. The intuition behind this is that we know that a bigram made up of words that do not occur in the test data will not be needed for creating representations of the test data for clustering.

Note that this program DOES NOT adjust the any of the counts as found in the BIGRAM file. Thus, the intent of this program is to make it possible to reduce the number of features that must be searched through during feature matching, for example. However, it is not intended to adjust the sample size or counts of bigrams. The counts remain the same, so any conclusions drawn from statistic.pl, for example, are not affected by this program.

INPUT

Required Arguments:

BIGRAM

Should be a bigram file created by NSP programs, count.pl, statistic.pl or combig.pl. Each line containing a word pair (bigram or co-occurrences) should show either of the following forms:

        word1<>word2<>n11 n1p np1       (if created by count or combig)

        word1<>word2<>rank score n11 n1p np1 (if created by statistic)

Any line that is not formatted like above is simply displayed as it is assuming that its printed by the --extended option in NSP.

UNIGRAM

Should be a unigram output of NSP. Each line in the UNIGRAM file should show either of the following forms:

        word<>

        word<>n

where n is the frequency count of the word.

Optional Arguments:

--help

Displays the summary of command line options.

--version

Displays the version information.

OUTPUT

reduce-count.pl displays all lines in the given BIGRAM file except those that are formatted as follows:

        word1<>word2<>n11 n1p np1

        word1<>word2<>rank score n11 n1p np1 

and neither word1 nor word2 are listed in the given UNIGRAM file.

SYSTEM REQUIREMENTS

BUGS

This program is very conservative in what it removes from a given set of input bigrams. Just because a unigram occurs in the test set of data does not mean that a bigram that contains it must occur. So reduce-count.pl very likely leaves in place some bigrams that do not occur in a given set of test data. However, it can be applied to bigrams, co-occurrence and target-co-occurrences equally well, since it is only looking for a unigram and not an exact match between bigrams (where order matters).

It must also be remembered that this program was originally intended for use with huge-count.pl, a program from the Ngram Statistics Package that calculates count information on very large corpora. So if you have 1,000,000 different bigrams, then removing all of those that don't contain a given set of unigrams drawn from a much smaller sample of test data will make a very large difference.

It seems like it should be possible to be more aggressive and find, for example, the intersection of the bigrams found in training and test data, and reset the features to that.

It also seems like it might be useful to reduce the number of unigram features in the same way.

This might be replaceable with a simple program that finds the intersection of features observed in training data and those that actually exist in the test data.

AUTHORS

Amruta Purandare, University of Pittsburgh

Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 124:

=over should be: '=over' or '=over positive_number'

Around line 128:

You forgot a '=back' before '=head1'