Ted Pedersen > Text-NSP > huge-split.pl

Download:
Text-NSP-1.27.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

huge-split.pl - Split bigram files from huge-count.pl into pieces.

DESCRIPTION ^

See perldoc huge-split.pl

USAGE ^

huge-split.pl [OPTIONS] SOURCE

INPUT ^

Required Arguments:

SOURCE

Input to huge-split.pl should be a file generated by huge-count.pl or count.pl with tokenlist option. The results files have the same name with the input source file and each split file has an extention sequence number.

--split N

This parameter should be set. huge-split will divide the output bigrmas tokenlist generated by count.pl or huge-count.pl. Each part created with --split N will contain N lines. Value of N should be chosen such that huge-sort.pl can be efficiently run on any part containing N lines from the file contains all bigrams file.

We suggest that N is equal to the number of KB of memory you have. If the computer has 8 GB RAM, which is 8,000,000 KB, N should be set to 8000000.

Other Options :

--help

Displays this message.

--version

Displays the version information.

AUTHOR ^

Amruta Purandare, Ted Pedersen, Ying Liu. University of Minnesota at Duluth.

COPYRIGHT ^

Copyright (c) 2004-2011

Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu

Ying Liu, University of Minnesota, Twin Cities. liux0395@umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

syntax highlighting: