bitsimat.pl - Build a similarity matrix from binary context vectors
bitsimat.pl [OPTIONS] VECTOR
The input file represents 5 vectors, each with 4 possible features. The format of the input file is sparse, so if a feature has no value it is not listed.
cat input
Output =>
5 4 12 1 1 3 1 4 1 2 1 3 1 1 1 3 1 4 1 2 1 3 1 2 1 3 1
Compute the pairwise similarities between all 5 binary vectors and display them in a 5x5 matrix.
bitsimat.pl input --format f4.2
5 25 1 1.00 2 0.41 3 1.00 4 0.41 5 0.41 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00 1 1.00 2 0.41 3 1.00 4 0.41 5 0.41 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
Type bitsimat.pl --help for a quick summary of options
bitsimat.pl --help
Constructs a similarity matrix for the given binary context vectors. A similarity matrix shows the pairwise similarities between all the different vectors. Vectors are represented in an N x M matrix, where N is the number of vectors and M is the number of features. All NxN combinations of vector pairs will be measured for similarity and stored in a matrix.
A binary vector file as created by vector constructor programs in Toolkit/vector.
By default, VECTOR is assumed to be in sparse format.
For sparse vectors, the first line of the VECTOR file should show 3 numbers separated in spaces as -
N M NNZ
where
N = Number of vectors/rows M = Number of dimensions/columns NNZ = Total number of non-zero values
Each line after this line should show a single sparse vector on each line. A sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the column index of a non-zero value in the vector and the second number is the non-zero value itself corresponding to that index. bitsimat treats all non-zero values as 1s in the bit vectors.
Column indices start with 1.
Sample Sparse Input
9 12 19 4 1 7 1 1 1 1 1 4 1 8 1 11 1 1 1 7 1 4 1 5 1 10 1 1 1 5 1 8 1 12 1 1 1 2 1 8 1
Explanation :
The first line shows that there are total 9 sparse vectors represented in 12 dimensions. Hence, bitsimat will consider next 9 lines in the VECTOR file as 9 sparse vectors and will allow the column indices only in the range [1-12]. Here, the total non-zero values are 19.
Note that, each line shows a single sparse vector as a list of space separated 'index value' pairs. e.g. 2nd line shows that the 1st vector is non-zero (with value 1) at column indices 4 and 7. 3rd line shows that the 2nd sparse vector is non-zero only at index 1. And so on ...
If VECTOR uses dense format, switch --dense should be selected.
The 1st line in the dense VECTOR file should show -
N M
for N vectors represented using M columns.
Each line thereafter should show a single dense vector. All dense vectors should have M space separated numbers that indicate the values at the corresponding columns in the vector.
All non-zero values are treated as 1s in the binary vectors.
Sample Dense Input
9 12 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
shows same vectors as shown in Sample Sparse Input in dense format.
VECTOR file could also optionally show the KEY file name on the first line. If the VECTOR file starts with the <keyfile> name on the 1st line, the above information should begin from the 2nd line onwards.
This switch should be selected if the given VECTORs are in dense format. This will also create the output similarity matrix in the dense format. By default, sparse format is assumed for both input vectors and output similarity matrix.
Specify the similarity measure to be used to construct the similarity matrix.
Possible option values for --sim are
match Match Coefficient dice Dice Coefficient overlap Overlap Coefficient jaccard Jaccard Coefficient cosine Cosine Coefficient
The following table shows how the similarity values are computed by each of these measures.
match |intersection(X,Y)| dice 2*|intersection(X,Y)|/(|X|+|Y|) overlap |intersection(X,Y)|/(min(|X|,|Y|)) jaccard |intersection(X,Y)|/|union(X,Y)| cosine |intersection(X,Y)|/sqrt(|X|*|Y|)
where X and Y represent any two binary vectors |X| shows the norm or number of bits set in vector X
Specifies numeric format for representing output similarity values.
Possible values of FORM are
iN -> integer format allocating total N bytes/digits for each entry fN.M -> floating point format allocating total N bytes/digits for each entry of which last M digits show the fractional part.
For matching coefficient, default is i8 and for other measures, default is f16.10.
Displays this message.
Displays the version information.
If the input VECTORs are in sparse format (default), output is also created in sparse format, while, if the input vectors are in dense format, output is also in dense format.
The 1st line in the sparse output shows 2 space separated numbers -
N NNZ1
N = Number of vectors (same as the N in the VECTOR file) NNZ1 = Number of non-zero entries in the output similarity matrix
Each line after this will show the corresponding row of the similarity matrix in sparse format. Specifically, each i'th line after the above line shows the list of 'j SIM' pairs separated by space such that SIM is the non-zero similarity value between the i'th and j'th vectors in the given VECTOR file.
Note that, only those pairs are listed in which the SIM value is non-zero.
Sample Sparse Output
9 43 1 1.000 3 0.354 4 0.500 5 0.707 2 1.000 3 0.500 4 0.707 7 1.000 9 0.577 1 0.354 2 0.500 3 1.000 4 0.354 5 0.500 7 0.500 8 0.289 9 0.577 1 0.500 2 0.707 3 0.354 4 1.000 7 0.707 9 0.408 1 0.707 3 0.500 5 1.000 6 1.000 8 0.408 2 1.000 3 0.500 4 0.707 7 1.000 9 0.577 3 0.289 6 0.408 8 1.000 9 0.333 2 0.577 3 0.577 4 0.408 7 0.577 8 0.333 9 1.000
Shows the actual full similarity matrix -
9 1.000 0.000 0.354 0.500 0.707 0.000 0.000 0.000 0.000 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577 0.354 0.500 1.000 0.354 0.500 0.000 0.500 0.289 0.577 0.500 0.707 0.354 1.000 0.000 0.000 0.707 0.000 0.408 0.707 0.000 0.500 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.408 0.000 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577 0.000 0.000 0.289 0.000 0.000 0.408 0.000 1.000 0.333 0.000 0.577 0.577 0.408 0.000 0.000 0.577 0.333 1.000
Both these matrices show the pairwise cosine similarities among the same 9 vectors shown in the Sample Sparse and Dense Input sections.
If --dense is selected, output is created in dense format and shows all similarity values including 0s.
Dense output shows the number of vectors (N) on the first line. Each i'th line after this shows N space separated numbers such that a number at column index j shows the pairwise similarity among the i'th and j'th vectors.
Sample Dense Output
Shows the pairwise similarities among the 9 vectors shown in the Sample Dense Input section.
Note that the similarity matrix will always be square and symmetric.
If the first line of the input VECTOR file shows the <keyfile> tag, bitsimat.pl will display the same <keyfile> tag on the first line of output.
bitsimat is dependent on the following CPAN modules :
Amruta Purandare, University of Pittsburgh Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu
Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
To install Text::SenseClusters, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::SenseClusters
CPAN shell
perl -MCPAN -e shell install Text::SenseClusters
For more information on module installation, please visit the detailed CPAN module installation guide.