search.cpan.org is shutting down
Ted Pedersen > Text-SenseClusters-1.05 > svdpackout.pl

Text-SenseClusters-1.05.tar.gz

Annotate this POD

 Open 0
View/Report Bugs

NAME

svdpackout.pl - Reconstruct post-SVD form of matrix from singular values output by SVDPACKC

SYNOPSIS

` svdpackout.pl [OPTIONS] lav2 lao2`

Type `svdpackout.pl --help` for a quick summary of options

DESCRIPTION

Reconstructs a matrix from its singular values and singular vectors created by SVDPACKC. The result of this is essentially a "smoothed" matrix equal in size to the original pre-SVDPACKC matrix, but where the non-significant dimensions have been removed.

SVDPACKC decomposes the original input matrix into three matrices :

• U : M x k (rows by k)
• S : k x k (k singular values of input matrix stored in a diagonal matrix)
• VT : k x N (columns by k, obtained by transposing V : N x k)

where k is the dimension value to which we have reduced the matrix (maxprs), M is the number of rows, and N is the number of columns.

We will normally keep the first k singular values (since those are organized greatest to least in S, and represent the most significant dimensions in the data). We do this by keeping the first k rows of S and VT, and the first k columns of U.

When we use --rowonly we reconstruct an M x k matrix by taking the product U * S. This matrix represents all of our original rows in a reduce column/dimension space which is then clustered by Cluto.

If we don't use --rowonly we reconstruct the original M x N matrix, but only using k dimensions (which gives us a kind of smoothing effect). That is we take the product (M x k) * (k x k) * (k * N). Note that discriminate.pl defaults to using --rowonly reconstration, at least in part in the interests of computational efficiently.

INPUT

Required Arguments:

lav2

Binary output file created by SVDPACKC las2

lao2

ASCII output file created by SVDPACK las2

Optional Arguments:

--rowonly

Only the row vectors are reconstructed. By default, svdpackout reconstructs entire matrix. This may not be used with --output. This is the default setting for discriminate.pl.

--output OUTPUT

Specifies the form of the output to be written by this program:

```  reconstruct - re-constructs the full rank-k matrix, output to STDOUT
rowonly - same as --rowonly, output to STDOUT
components - output the U, S, and VT matrices to U.txt, S.txt, VT.txt```

--sqrt

In --rowonly reconstruction, take sqrt of S (kxk). This was the default method in SenseClusters 1.00 and previous. Has no effect when used with full reconstruction. Provided mainly for backwards compatability.

--negatives

Set negative values in reconstructed matrices that are between -1 and 0 to 0 (except in component output). This option is provided mainly for backwards compatibility as this was the default behavior in SenseClusters 1.00 and previous.

--format FORM

Specifies numeric format for representing output matrix values. Following formats are supported with --format :

``` iN - Output matrix will contain integer values each occupying N spaces

fM.N - Output matrix will contain real values each occupying total M spaces of which last N digits show fractional part. M spaces for each entry include the decimal point and +/- sign if any.```

Default format value is f16.10.

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT

svdpackout.pl displays a matrix reconstructed from the Singular Triplets created by SVD. By default, entire matrix (product of left and right singular vectors and singular values) is reconstructed. When --rowonly is ON, or when --output rowonly is set, only the reduced row vectors are built. When --output components is set, the three component matrices, U, S, and V, are output separately.

SYSTEM REQUIREMENTS

SVDPACKC - http://netlib.org/svdpack/ (also available in /External)
PDL - http://search.cpan.org/dist/PDL/

BUGS

In version 1.00 and before, we took the square root of the k x k matrix before recombining in --rowonly made. We have changed that to make it an option (via --sqrt), since the motives for that are at this point unclear. The following discussion originally formulated by Mahesh Joshi motivates our change to a default method of not taking --sqrt.

Deerwester et al. (S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391.407, 1990.), give a nice explanation of the reason of combing M x k and k x k to evaluate similarity between whatever was represented along the rows in the original matrix (contexts in order 1 and features in order 2).

They also mention the use of the combination Mxk and SquareRoot(kxk), but that is for evaluating correlation between a term and a document (as against between a term-term or document-document pair) and in such a case what is also needed is the combination of Nxk and SquareRoot(kxk) since in a heterogeneous pair (one term and one document) correlation analysis, one vector each is needed from these two different combinations. All this is mentioned in the "Technical Details" section of the Deerwester et al. paper.

But this does not seem to explain the use of the square root in the type of analysis we are doing - which is homogeneous, i.e. we are only analyzing term-term or context-context similarities.

In both --rowonly and full reconstruction, we smoothed negative values between 0 and -1 to 0. The motive for that is unclear, and so a option --negatives is provided to preserve negative values and override that behavior.

We would now generally recommend that svdpackout.pl be run without --sqrt and --negatives. That is do not take the square roots of the S matrix when doing reconstruction and let negative values stand. This is the default behavior as of 1.01 and beyond.

AUTHORS

``` Amruta Purandare, University of Pittsburgh

Richard Wicentowski, Swarthmore College
richardw at cs.swarthmore.edu

Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu```

``` The Free Software Foundation, Inc.,