The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

vector-input.pl - This program builds the term index file and co-occrrence matrix for umls-similarity.pl to calculate the vector relatedness.

SYNOPSIS

vector-input.pl takes the bigrams frequency input and build the index and the co-occurrence matrix.

DESCRIPTION

We build the index and co-occurrence matrix for the vector method of UMLS-Similarity. The index file helps to locate each term's vector by recording the start position and the length of its vector. The matrix file records every term's vector.

See perldoc vector-input.pl

USAGE

vector-input.pl INDEX MATRIX BIGRAMFILE

example: vector-input.pl Index.txt Matrix.txt BigramsList.txt

INPUT

Required Arguments:

INDEX

output file of the vector-input.pl. It records the index of each term and the vector start position and length f the co-occurrence matrix.

MATRIX

output file of the vector-input.pl. Each line is a vector for the term and its co-occurrence term and their frequency.

BIGRAMFILE

Input to vector-input.pl should be a single flat file generated by huge-count.pl of Text-NSP package. If the bigrams list is generated by count.pl, pleasue use count2huge.pl to convert the results to huge-count.pl. It sorts the bigrams in the alphabet order. When vector-input.pl generates the index and co-occurrence matrix file, it requires the bigrams which starts the same term t1 grouped together and lists next to each other. Because at this step, bigrams are not stored in memory. If the first term of the bigrams changes, it prints the output and index position of the vector for the term t1. Especially, if the bigrams are sorted in the alphabet order, it is faster for vector method of UMLS-Similarity to build the vector. Because for each concept, it searches the co-occurrence matrix to build the second order vector. If every term of the vector are sorted, the vector method can search the co-occurrence matrix from the beginning to the end by the index position and length. If the co-occurrence matrix is a huge file, it could save lots of execute time.

Other Options:

--stat

The bigram file is from statistics.pl rather than count.pl

--cutoff SCORE

Only use those ngrams that are greater than SCORE

--help

Displays the help information.

--version

Displays the version information.

AUTHOR

Ying Liu, liux0395 at umn.edu

SEE ALSO

home page: www.tc.umn.edu/~liux0395

COPYRIGHT

Copyright (C) 2010, Ying Liu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.