Ted Pedersen > Text-NSP > huge-combine3.pl

Download:
Text-NSP-1.27.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

huge-combine3.pl - Combine two trigram files created by count.pl into single file

SYNOPSIS ^

Combines two trigram files created by count.pl into a single trigram file.

USGAE ^

huge-combine3.pl [OPTIONS] COUNT1 COUNT2

INPUT ^

Required Arguments:

COUNT1 and COUNT2

combine-count.pl takes two trigram files created by count.pl as input. If COUNT1 and COUNT2 are of unequal sizes, it is strongly recommended that COUNT1 should be the smaller file and COUNT2 should be the lager trigram file.

Each line in files COUNT1, COUNT2 should be formatted as -

word1<>word2<>n11 n1p np1

where word1<>word2 is a trigram, n11 is the joint frequency score of this trigram, n1p is the number of trigrams in which word1 is the first word, while np1 is the number of trigrams having word2 as the second word.

Optional Arguments:

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

Output displays all trigrams that appear either in COUNT1 (inclusive) or in COUNT2 along with their updated scores. Scores are updated such that -

1:

If a trigram appears in both COUNT1 and COUNT2, their n11 scores are added.

e.g. If COUNT1 contains a trigram word1<>word2<>n11 n1p np1 and COUNT2 has a trigram word1<>word2<>m11 m1p mp1

Then, the new n11 score of trigram word1<>word2 is n11+m11

2:

If the two trigrams belonging to COUNT1 and COUNT2 share a commom first word, their n1p scores are added.

e.g. If COUNT1 contains a trigram word1<>word2<>n11 n1p np1 and if COUNT2 contains a trigram word1<>word3<>m11 m1p mp1

Then, the n1p marginal score of word1 is updated to n1p+m1p

3:

If the two trigrams belonging to COUNT1 and COUNT2 share a commom second word, their np1 scores are added.

e.g. If COUNT1 contains a trigram word1<>word2<>n11 n1p np1 and if COUNT2 contains a trigram word3<>word2<>m11 m1p mp1

Then, the np1 marginal score of word2 is updated to np1+mp1

AUTHOR ^

Amruta Purandare, Ted Pedersen. University of Minnesota at Duluth.

COPYRIGHT ^

Copyright (c) 2004, 2009

Amruta Purandare, University of Minnesota, Duluth. pura0010@umn.edu

Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu

Cyrus Shaoul, University of Alberta, Edmonton cyrus.shaoul@ualberta.ca

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

syntax highlighting: