#!/usr/local/bin/perl -w
=head1 NAME
huge-count.pl - Count all the bigrams in a huge text without using huge amounts of memory.
=head1 SYNOPSIS
huge-count.pl --tokenlist --split 100 destination-dir input
=head1 DESCRIPTION
Runs count.pl efficiently on large amounts of data by splitting the data into separate files, and counting up each file separately, and then merging them to get overall results.
Two output files are created. destination-dir/huge-count.output contains
the bigram counts after applying --remove and --remove.
destination-dir/complete-huge-count.output provides the bigram counts as
if no --uremove or --remove cutoff were provided.
=head1 USAGE
huge-count.pl [OPTIONS] DESTINATION [SOURCE]+
=head1 INPUT
=head2 Required Arguments:
=head3 [SOURCE]+
Input to huge-count.pl should be a -
=over
=item 1. Single plain text file
Or
=item 2. Single flat directory containing multiple plain text files
Or
=item 3. List of multiple plain text files
=back
=head3 DESTINATION
A complete path to a writable directory to which huge-count.pl can write all
intermediate and final output files. If DESTINATION does not exist,
a new directory is created, otherwise, the current directory is simply used
for writing the output files.
NOTE: If DESTINATION already exists and if the names of some of the existing
files in DESTINATION clash with the names of the output files created by
huge-count, these files will be over-written w/o prompting user.
=head3 --tokenlist
This parameter is required. huge-count will call count.pl and print out all
the bigrams count.pl can find out.
=head2 Optional Arguments:
=head4 --split N
This parameter is required. huge-count will divide the output bigrams
tokenlist generated by count.pl, sort on each part and recombine the bigram
counts from all these intermediate result files into a single bigram output
that shows bigram counts in SOURCE.
Each part created with --split N will contain N lines. Value of N should be
chosen such that huge-sort.pl can be efficiently run on any part containing
N lines from the file contains all bigrams file.
We suggest that N is equal to the number of KB of memory you have. If the
computer has 8 GB RAM, which is 8,000,000 KB, N should be set to 8000000. If
N is set too small, split output file suffixes exhausted.
=head4 --token TOKENFILE
Specify a file containing Perl regular expressions that define the tokenization
scheme for counting. This will be provided to count.pl's --token option.
--nontoken NOTOKENFILE
Specify a file containing Perl regular expressions of non-token sequences
that are removed prior to tokenization. This will be provided to the
count.pl's --nontoken option.
--stop STOPFILE
Specify a file of Perl regex/s containing the list of stop words to be
omitted from the output BIGRAMS. Stop list can be used in two modes -
AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE
or
OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE.
In AND mode, bigrams whose both constituent words are stop words are removed
while, in OR mode, bigrams whose either or both constituent words are
stopwords are removed from the output.
=head4 --window W
Tokens appearing within W positions from each other (with at most W-2
intervening words) will form bigrams. Same as count.pl's --window option.
=head4 --remove L
Bigrams with counts less than L in the entire SOURCE data are removed from
the sample. The counts of the removed bigrams are not counted in any
marginal totals. This has same effect as count.pl's --remove option.
=head4 --uremove L
Bigrams with counts more than L in the entire SOURCE data are removed from
the sample. The counts of the removed bigrams are not counted in any
marginal totals. This has same effect as count.pl's --uremove option.
=head4 --frequency F
Bigrams with counts less than F in the entire SOURCE are not displayed.
The counts of the skipped bigrams ARE counted in the marginal totals. In other
words, --frequency in huge-count.pl has same effect as the count.pl's
--frequency option.
=head4 --ufrequency F
Bigrams with counts more than F in the entire SOURCE are not displayed.
The counts of the skipped bigrams ARE counted in the marginal totals. In other
words, --frequency in huge-count.pl has same effect as the count.pl's
--ufrequency option.
=head4 --newLine
Switches ON the --newLine option in count.pl. This will prevent bigrams from
spanning across the lines.
=head3 Other Options :
=head4 --help
Displays this message.
=head4 --version
Displays the version information.
=head1 PROGRAM LOGIC
=over
=item * STEP 1
# create output dir
if(!-e DESTINATION) then
mkdir DESTINATION;
=item * STEP 2
=over 3
=item 1. If SOURCE is a single plain file -
huge-count.pl with --tokenlist option call count.pl and run on the single
plain file and print out all bigrams into one file. The count outputs are
also created in DESTINATION.
=item 2. SOURCE is a single flat directory containing multiple plain files -
huge-count.pl with --tokenlist option call count.pl and run on each file
present in the SOURCE directory. All files in SOURCE are treated as the
data files. If SOURCE contains sub-directories, these are simply skipped.
Intermediate bigram outputs are written in DESTINATION.
=item 3. SOURCE is a list of multiple plain files -
If #arg > 2, all arguments specified after the first argument are considered
as the SOURCE file names. count.pl is separately run on each of the SOURCE
files specified by argv[1], argv[2], ... argv[n] (skipping argv[0] which
should be DESTINATION). Intermediate results are created in DESTINATION.
=back
In summary, a large datafile can be provided to huge-count in the form of
a. A single plain file
b. A directory containing several plain files
c. Multiple plain files directly specified as command line arguments
In all these cases, count.pl with --tokenlist is separately run on SOURCE
files or parts of SOURCE file and intermediate results are written in
DESTINATION dir.
=back
=over
=item * STEP 3
Split the output file generate by count.pl with --tokenlist into smaller
files by the number of bigrams N.
=item * STEP 4
huge-sort.pl counts the unique bigrams and sort them in alphabetic order.
=item * STEP 5
huge-merge.pl merge the bigrams of each sorted bigrams file.
=back
=head1 OUTPUT
After huge-count finishes successfully, DESTINATION will contain -
=over
=item * Final bigram count file (huge-count.output) showing bigram counts in
the entire SOURCE after --remove and --uremove applied.
=item * Final bigram count file (complete-huge-count.output) showing
bigram counts in the entire SOURCE without --remove and --uremove.
=back
=head1 BUGS
huge-count.pl doesn't consider bigrams at file boundaries. In other words,
the result of count.pl and huge-count.pl on the same data file will
differ if --newLine is not used, in that, huge-count.pl runs count.pl
on multiple files separately and thus looses the track of the bigrams
on file boundaries. With --window not specified, there will be loss
of one bigram at each file boundary while its W bigrams with --window W.
Functionality of huge-count with --tokenlist is same as count only if
--newLine is used and all files start and end on sentence boundaries.
In other words, there should not be any sentence breaks at the start or
end of any file given to huge-count.
=head1 AUTHOR
Amruta Purandare, University of Minnesota, Duluth
Ted Pedersen, University of Minnesota, Duluth
tpederse at umn.edu
Ying Liu, University of Minnesota, Twin Cities
liux0395 at umn.edu
=head1 COPYRIGHT
Copyright (c) 2004-2010, Amruta Purandare, Ted Pedersen, and Ying Liu
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
=cut
###############################################################################
#$0 contains the program name along with
#the complete path. Extract just the program
#name and use in error messages
$0=~s/.*\/(.+)/$1/;
###############################################################################
# ================================
# COMMAND LINE OPTIONS AND USAGE
# ================================
# command line options
use Cwd;
use Getopt::Long;
GetOptions ("help","version","tokenlist","token=s","nontoken=s","remove=i","uremove=i", "window=i","stop=s","split=i","frequency=i","ufrequency=i", "newLine");
# show help option
if(defined $opt_help)
{
$opt_help=1;
&showhelp();
exit;
}
# make sure tokenlist is used in huge-count.pl
if (!defined $opt_tokenlist)
{
print "--tokenlist is required!\n";
print STDERR "Type huge-count.pl --help for help.\n";
exit;
}
if ((defined $opt_remove) and (defined $opt_uremove))
{
if ($opt_remove > $opt_uremove)
{
print "--remove must be smaller than --uremove!\n";
print STDERR "Type huge-count.pl --help for help.\n";
exit;
}
}
if ((defined $opt_frequency) and (defined $opt_ufrequency))
{
if ($opt_frequency > $opt_ufrequency)
{
print "--frequency must be smaller than --ufrequency!\n";
print STDERR "Type huge-count.pl --help for help.\n";
exit;
}
}
# show version information
if(defined $opt_version)
{
$opt_version=1;
&showversion();
exit;
}
# show minimal usage message if fewer arguments
if($#ARGV<1)
{
&showminimal();
exit;
}
#############################################################################
# ========================
# CODE SECTION
# ========================
#accept the destination dir name
my $current_dir = getcwd;
$destdir=$ARGV[0];
if(-e $destdir)
{
if(!-d $destdir)
{
print STDERR "ERROR($0):
$destdir is not a directory.\n";
exit;
}
}
else
{
system("mkdir $destdir");
}
# ----------
# Counting
# ----------
# source = dir
if($#ARGV==1 && -d $ARGV[1])
{
$sourcedir=$ARGV[1];
opendir(DIR,$sourcedir) || die "ERROR($0):
Error (code=$!) in opening Source Directory <$sourcedir>.\n";
while(defined ($file=readdir DIR))
{
next if $file =~ /^\.\.?$/;
if(-f "$sourcedir/$file")
{
&runcount("$sourcedir/$file",$destdir);
}
}
}
# source is a single file
elsif($#ARGV==1 && -f $ARGV[1])
{
$source=$ARGV[1];
system("cp $source $destdir");
if(defined $opt_token)
{
system("cp $opt_token $destdir");
}
if(defined $opt_nontoken)
{
system("cp $opt_nontoken $destdir");
}
if(defined $opt_stop)
{
system("cp $opt_stop $destdir");
}
chdir $destdir;
$chdir=1;
&runcount($source,".");
}
# source contains multiple files
elsif($#ARGV > 1)
{
foreach $i (1..$#ARGV)
{
if(-f $ARGV[$i])
{
&runcount($ARGV[$i],$destdir);
}
else
{
print STDERR "ERROR($0):
ARGV[$i]=$ARGV[$i] should be a plain file.\n";
exit;
}
}
}
# unexpected input
else
{
&showminimal();
exit;
}
# --------------------
# Split bigrams
# --------------------
if(!defined $chdir)
{
chdir $destdir;
$chdir = 1;
}
# current dir is now destdir
opendir(DIR,".") || die "ERROR($0):
Error (code=$!) in opening Destination Directory <$destdir>.\n";
if (defined $opt_split)
{
print "split the bigrams files...\n";
while(defined ($file = readdir DIR))
{
if($file=~/\.bigrams$/)
{
system("huge-split.pl --split $opt_split $file");
system("/bin/rm $file");
}
}
}
else
{
print STDERR "Warning($0): You can run huge-sort.pl directly on the \n";
print STDERR "single tokenlist file if don't want to split the tokenlist.\n";
}
# --------------------
# Sort bigrams
# --------------------
if (defined $opt_tokenlist)
{
print "sort the bigrams files...\n";
if(!defined $chdir)
{
chdir $destdir;
$chdir = 1;
}
# current dir is now destdir
opendir(DIR,".") || die "ERROR($0):
Error (code=$!) in opening Destination Directory <$destdir>.\n";
while(defined ($file = readdir DIR))
{
if(($file=~/\.bigrams/) and ($file !~ /sorted$/))
{
system("huge-sort.pl $file");
}
}
}
# --------------------
# Combine bigrams
# --------------------
print "combine the bigrams files...\n";
if(defined $chdir)
{
chdir $current_dir;
}
system("huge-merge.pl $destdir");
# --------------------
# Delete bigrams
# --------------------
print "delete the bigrams ...\n";
if (defined $opt_remove)
{
if (defined $opt_uremove)
{
if (defined $opt_frequency)
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
}
}
# --frequency not used
else
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --remove $opt_remove --uremove $opt_uremove $destdir/merge* $destdir/finalmerge");
}
}
}
# --uremove not used
else
{
if (defined $opt_frequency)
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --remove $opt_remove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
}
}
# --frequency not used
else
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --remove $opt_remove --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --remove $opt_remove $destdir/merge* $destdir/finalmerge");
}
}
}
}
# --remove not used
else
{
if (defined $opt_uremove)
{
if (defined $opt_frequency)
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --uremove $opt_uremove --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
}
}
# --frequency not used
else
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --uremove $opt_uremove --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --uremove $opt_uremove $destdir/merge* $destdir/finalmerge");
}
}
}
# --uremove not used
else
{
if (defined $opt_frequency)
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --frequency $opt_frequency --ufrequency $opt_ufrequency $destdir/merge* $destdir/finalmerge");
}
else
{
system("huge-delete.pl --frequency $opt_frequency $destdir/merge* $destdir/finalmerge");
}
}
# --frequency not used
else
{
if (defined $opt_ufrequency)
{
system("huge-delete.pl --ufrequency $opt_ufrequency $destdir/mgerge* $destdir/finalmerge");
}
}
}
}
$output="complete-huge-count.output";
if ((defined $opt_remove ) or (defined $opt_uremove) or (defined $opt_frequency) or (defined $opt_ufrequency))
{
system("mv $destdir/merge.* $destdir/$output");
system("mv $destdir/finalmerge $destdir/huge-count.output");
print STDERR "Check the output in $destdir/huge-count.output\n";
}
else
{
system("mv $destdir/merge.* $destdir/$output");
print STDERR "Check the output in $destdir/$output\n";
}
exit;
##############################################################################
# ==========================
# SUBROUTINE SECTION
# ==========================
sub runcount()
{
my $file=shift;
my $destdir=shift;
my $justfile=$file;
$justfile=~s/.*\/(.+)/$1/;
# --tokenlist used
if(defined $opt_tokenlist)
{
# --window used
if(defined $opt_window)
{
# --token used
if(defined $opt_token)
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file")
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
}
}
}
}
# --token not used
else
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --window $opt_window $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --window $opt_window $destdir/$justfile.bigrams $file");
}
}
}
}
}
# --window not used
else
{
# --token used
if(defined $opt_token)
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --token $opt_token $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --token $opt_token $destdir/$justfile.bigrams $file");
}
}
}
}
# --token not used
else
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --tokenlist --newLine $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --tokenlist $destdir/$justfile.bigrams $file");
}
}
}
}
}
}
# --tokenlist not used
else
{
# --window used
if(defined $opt_window)
{
# --token used
if(defined $opt_token)
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file")
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --token $opt_token $destdir/$justfile.bigrams $file");
}
}
}
}
# --token not used
else
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --window $opt_window $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --window $opt_window $destdir/$justfile.bigrams $file");
}
}
}
}
}
# --window not used
else
{
# --token used
if(defined $opt_token)
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --token $opt_token --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --token $opt_token --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --token $opt_token --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --token $opt_token $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --token $opt_token $destdir/$justfile.bigrams $file");
}
}
}
}
# --token not used
else
{
# --nontoken used
if(defined $opt_nontoken)
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --nontoken $opt_nontoken --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --nontoken $opt_nontoken $destdir/$justfile.bigrams $file");
}
}
}
# nontoken not used
else
{
# --stop used
if(defined $opt_stop)
{
if(defined $opt_newLine)
{
system("count.pl --newLine --stop $opt_stop $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl --stop $opt_stop $destdir/$justfile.bigrams $file");
}
}
# --stop not used
else
{
if(defined $opt_newLine)
{
system("count.pl --newLine $destdir/$justfile.bigrams $file");
}
else
{
system("count.pl $destdir/$justfile.bigrams $file");
}
}
}
}
}
}
} # end of sub runcount()
#-----------------------------------------------------------------------------
#show minimal usage message
sub showminimal()
{
print "Usage: huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+";
print "\nTYPE huge-count.pl --help for help\n";
}
#-----------------------------------------------------------------------------
#show help
sub showhelp()
{
print "Usage: huge-count.pl --tokenlist [OPTIONS] DESTINATION [SOURCE]+
Efficiently runs count.pl on a huge data.
SOURCE
Could be a -
1. single plain file
2. single flat directory containing multiple plain files
3. list of plain files
DESTINATION
Should be a directory where output is written.
REQUIRED PARAMETERS:
--tokenlist
This option is required. Print out all bigrams list.
OPTIONS:
--split N
Number of bigrams for each seperated bigrams file.
--token TOKENFILE
Specify a file containing Perl regular expressions that define the
tokenization scheme for counting.
--nontoken NOTOKENFILE
Specify a file containing Perl regular expressions of non-token
sequences that are removed prior to tokenization.
--stop STOPFILE
Specify a file containing Perl regular expressions of stop words
that are to be removed from the output bigrams.
--window W
Specify the window size for counting.
--remove L
Bigrams with counts less than L will be removed from the sample.
remove must be smaller than uremove.
--uremove L
Bigrams with counts more than L will be removed from the sample.
uremove must be bigger than remove.
--frequency F
Bigrams with counts less than F will not be displayed.
frequency must be smaller than ufrequency.
--ufrequency F
Bigrams with counts more than F will not be displayed.
ufrequency must be bigger than frequency.
--newLine
Prevents bigrams from spanning across the new-line characters.
--help
Displays this message.
--version
Displays the version information.
Type 'perldoc huge-count.pl' to view detailed documentation of huge-count.\n";
}
#------------------------------------------------------------------------------
#version information
sub showversion()
{
print 'huge-count.pl $Id: huge-count.pl,v 1.26 2011/03/31 23:04:04 tpederse Exp $';
print "\nEfficiently runs count.pl on a huge data.\n";
print "Copyright (C) 2004-2011, Amruta Purandare, Ted Pedersen & Ying Liu.\n";
}
#############################################################################