Bridget McInnes > Text-NSP-1.21 > huge-count3.pl

Download:
Text-NSP-1.21.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source   Latest Release: Text-NSP-1.27

NAME ^

huge-count3.pl - Divide huge text into pieces and run huge-count3.pl for 3grams separately on each (and then combine)

SYNOPSIS ^

Runs count.pl efficiently on a huge data.

USGAE ^

huge-count3.pl [OPTIONS] DESTINATION [SOURCE]+

INPUT ^

Required Arguments:

[SOURCE]+

Input to huge-count3.pl should be a -

1. Single plain text file

Or

item 2. Single flat directory containing multiple plain text files

Or

3. List of multiple plain text files

DESTINATION

A complete path to a writable directory to which huge-count3.pl can write all intermediate and final output files. If DESTINATION does not exist, a new directory is created, otherwise, the current directory is simply used for writing the output files.

NOTE: If DESTINATION already exists and if the names of some of the existing files in DESTINATION clash with the names of the output files created by huge-count, these files will be over-written w/o prompting user.

Optional Arguments:

--split P

This option should be specified when SOURCE is a single plain file. huge-count will divide the given SOURCE file into P (approximately) equal parts, will run count.pl separately on each part and will then recombine the trigram counts from all these intermediate result files into a single trigram output that shows trigram counts in SOURCE.

If SOURCE file contains M lines, each part created with --split P will contain approximately M/P lines. Value of P should be chosen such that count.pl can be efficiently run on any part containing M/P lines from SOURCE. As #words/line differ from files to files, it is recommended that P should be large enough so that each part will contain at most million words in total.

--token TOKENFILE

Specify a file containing Perl regular expressions that define the tokenization scheme for counting. This will be provided to count.pl's --token option.

--nontoken NOTOKENFILE

Specify a file containing Perl regular expressions of non-token sequences that are removed prior to tokenization. This will be provided to the count.pl's --nontoken option.

--stop STOPFILE

Specify a file of Perl regex/s containing the list of stop words to be omitted from the output TRIGRAMS. Stop list can be used in two modes -

AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE

or

OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE.

In AND mode, trigrams whose both constituent words are stop words are removed while, in OR mode, triigrams whose either or both constituent words are stopwords are removed from the output.

--window W

Tokens appearing within W positions from each other (with at most W-2 intervening words) will form trigrams. Same as count.pl's --window option.

--remove L

Trigrams with counts less than L in the entire SOURCE data are removed from the sample. The counts of the removed trigrams are not counted in any marginal totals. This has same effect as count.pl's --remove option.

--frequency F

trigrams with counts less than F in the entire SOURCE are not displayed. The counts of the skipped trigrams ARE counted in the marginal totals. In other words, --frequency in huge-count3.pl has same effect as the count.pl's --frequency option.

--newLine

Switches ON the --newLine option in count.pl. This will prevent trigrams from spanning across the lines.

Other Options :

--help

Displays this message.

--version

Displays the version information.

PROGRAM LOGIC ^

STEP 3

Intermediate count results created in STEP 2 are recombined in a pair-wise fashion such that for P separate count output files, C1, C2, C3 ... , CP,

C1 and C2 are first recombined and result is written to huge-count3.output

Counts from each of the C3, C4, ... CP are then combined (added) to huge-count3.output and each time while recombining, always the smaller of the two files is loaded.

STEP 4

After all files are recombined, the resultant huge-count3.output is then sorted in the descending order of the trigram counts. If --remove is specified, trigrams with counts less than the specified value of --remove, in the final huge-count3.output file are removed from the sample and their counts are deleted from the marginal totals. If --frequency is selected, trigrams with scores less than the specified value are simply skipped from output.

OUTPUT ^

After huge-count3 finishes successfully, DESTINATION will contain -

BUGS ^

huge-count3.pl doesn't consider trigrams at file boundaries. In other words, the result of count.pl and huge-count3.pl on the same data file will differ if --newLine is not used, in that, huge-count3.pl runs count.pl on multiple files separately and thus looses the track of the trigrams on file boundaries. With --window not specified, there will be loss of one trigram at each file boundary while its W trigrams with --window W.

Functionality of huge-count3 is same as count only if --newLine is used and all files start and end on sentence boundaries. In other words, there should not be any sentence breaks at the start or end of any file given to huge-count3.

AUTHOR ^

Amruta Purandare, Ted Pedersen. University of Minnesota at Duluth.

COPYRIGHT ^

Copyright (c) 2004, 2009

Amruta Purandare, University of Minnesota, Duluth. pura0010@umn.edu

Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu

Cyrus Shaoul, University of Alberta, Edmonton cyrus.shaoul@ualberta.ca

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

syntax highlighting: