Ted Pedersen > Text-SenseClusters-1.03 > svdpackout.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

svdpackout.pl - Reconstruct post-SVD form of matrix from singular values output by SVDPACKC

SYNOPSIS ^

 svdpackout.pl [OPTIONS] lav2 lao2

Type svdpackout.pl --help for a quick summary of options

DESCRIPTION ^

Reconstructs a matrix from its singular values and singular vectors created by SVDPACKC. The result of this is essentially a "smoothed" matrix equal in size to the original pre-SVDPACKC matrix, but where the non-significant dimensions have been removed.

SVDPACKC decomposes the original input matrix into three matrices :

where k is the dimension value to which we have reduced the matrix (maxprs), M is the number of rows, and N is the number of columns.

We will normally keep the first k singular values (since those are organized greatest to least in S, and represent the most significant dimensions in the data). We do this by keeping the first k rows of S and VT, and the first k columns of U.

When we use --rowonly we reconstruct an M x k matrix by taking the product U * S. This matrix represents all of our original rows in a reduce column/dimension space which is then clustered by Cluto.

If we don't use --rowonly we reconstruct the original M x N matrix, but only using k dimensions (which gives us a kind of smoothing effect). That is we take the product (M x k) * (k x k) * (k * N). Note that discriminate.pl defaults to using --rowonly reconstration, at least in part in the interests of computational efficiently.

INPUT ^

Required Arguments:

lav2

Binary output file created by SVDPACKC las2

lao2

ASCII output file created by SVDPACK las2

Optional Arguments:

--rowonly

Only the row vectors are reconstructed. By default, svdpackout reconstructs entire matrix. This may not be used with --output. This is the default setting for discriminate.pl.

--output OUTPUT

Specifies the form of the output to be written by this program:

  reconstruct - re-constructs the full rank-k matrix, output to STDOUT 
  rowonly - same as --rowonly, output to STDOUT
  components - output the U, S, and VT matrices to U.txt, S.txt, VT.txt

--sqrt

In --rowonly reconstruction, take sqrt of S (kxk). This was the default method in SenseClusters 1.00 and previous. Has no effect when used with full reconstruction. Provided mainly for backwards compatability.

--negatives

Set negative values in reconstructed matrices that are between -1 and 0 to 0 (except in component output). This option is provided mainly for backwards compatibility as this was the default behavior in SenseClusters 1.00 and previous.

--format FORM

Specifies numeric format for representing output matrix values. Following formats are supported with --format :

 iN - Output matrix will contain integer values each occupying N spaces

 fM.N - Output matrix will contain real values each occupying total M spaces of which last N digits show fractional part. M spaces for each entry include the decimal point and +/- sign if any.

Default format value is f16.10.

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

svdpackout.pl displays a matrix reconstructed from the Singular Triplets created by SVD. By default, entire matrix (product of left and right singular vectors and singular values) is reconstructed. When --rowonly is ON, or when --output rowonly is set, only the reduced row vectors are built. When --output components is set, the three component matrices, U, S, and V, are output separately.

SYSTEM REQUIREMENTS ^

SVDPACKC - http://netlib.org/svdpack/ (also available in /External)
PDL - http://search.cpan.org/dist/PDL/

BUGS ^

In version 1.00 and before, we took the square root of the k x k matrix before recombining in --rowonly made. We have changed that to make it an option (via --sqrt), since the motives for that are at this point unclear. The following discussion originally formulated by Mahesh Joshi motivates our change to a default method of not taking --sqrt.

Deerwester et al. (S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391.407, 1990.), give a nice explanation of the reason of combing M x k and k x k to evaluate similarity between whatever was represented along the rows in the original matrix (contexts in order 1 and features in order 2).

They also mention the use of the combination Mxk and SquareRoot(kxk), but that is for evaluating correlation between a term and a document (as against between a term-term or document-document pair) and in such a case what is also needed is the combination of Nxk and SquareRoot(kxk) since in a heterogeneous pair (one term and one document) correlation analysis, one vector each is needed from these two different combinations. All this is mentioned in the "Technical Details" section of the Deerwester et al. paper.

But this does not seem to explain the use of the square root in the type of analysis we are doing - which is homogeneous, i.e. we are only analyzing term-term or context-context similarities.

In both --rowonly and full reconstruction, we smoothed negative values between 0 and -1 to 0. The motive for that is unclear, and so a option --negatives is provided to preserve negative values and override that behavior.

We would now generally recommend that svdpackout.pl be run without --sqrt and --negatives. That is do not take the square roots of the S matrix when doing reconstruction and let negative values stand. This is the default behavior as of 1.01 and beyond.

AUTHORS ^

 Amruta Purandare, University of Pittsburgh

 Richard Wicentowski, Swarthmore College
 richardw at cs.swarthmore.edu

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT ^

Copyright (c) 2002-2008, Amruta Purandare, Richard Wicentowski, Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: