#!/usr/local/bin/perl -w
=head1 NAME
bitsimat.pl - Build a similarity matrix from binary context vectors
=head1 SYNOPSIS
bitsimat.pl [OPTIONS] VECTOR
The input file represents 5 vectors, each with 4 possible features. The
format of the input file is sparse, so if a feature has no value it is
not listed.
cat input
Output =>
5 4 12
1 1 3 1 4 1
2 1 3 1
1 1 3 1 4 1
2 1 3 1
2 1 3 1
Compute the pairwise similarities between all 5 binary vectors and
display them in a 5x5 matrix.
bitsimat.pl input --format f4.2
Output =>
5 25
1 1.00 2 0.41 3 1.00 4 0.41 5 0.41
1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
1 1.00 2 0.41 3 1.00 4 0.41 5 0.41
1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
Type C<bitsimat.pl --help> for a quick summary of options
=head1 DESCRIPTION
Constructs a similarity matrix for the given binary context vectors. A
similarity matrix shows the pairwise similarities between all the
different vectors. Vectors are represented in an N x M matrix, where N
is the number of vectors and M is the number of features. All NxN
combinations of vector pairs will be measured for similarity and stored
in a matrix.
=head1 INPUT
=head2 Required Arguments:
=head3 VECTOR
A binary vector file as created by vector constructor programs in Toolkit/vector.
=head4 Sparse Format (default)
By default, VECTOR is assumed to be in sparse format.
For sparse vectors, the first line of the VECTOR file should show 3
numbers separated in spaces as -
N M NNZ
where
N = Number of vectors/rows
M = Number of dimensions/columns
NNZ = Total number of non-zero values
Each line after this line should show a single sparse vector on each line.
A sparse vector is a list of pairs of numbers separated by space such
that the first number in a pair is the column index of a non-zero value in the
vector and the second number is the non-zero value itself corresponding to
that index. bitsimat treats all non-zero values as 1s in the bit vectors.
Column indices start with 1.
Sample Sparse Input
9 12 19
4 1 7 1
1 1
1 1 4 1 8 1 11 1
1 1 7 1
4 1
5 1 10 1
1 1
5 1 8 1 12 1
1 1 2 1 8 1
Explanation :
=over
=item 1.
The first line shows that there are total 9 sparse vectors
represented in 12 dimensions. Hence, bitsimat will consider next 9 lines
in the VECTOR file as 9 sparse vectors and will allow the column indices
only in the range [1-12]. Here, the total non-zero values are 19.
=item 2.
Note that, each line shows a single sparse vector as a list of
space separated 'index value' pairs. e.g. 2nd line shows that the 1st
vector is non-zero (with value 1) at column indices 4 and 7. 3rd line shows
that the 2nd sparse vector is non-zero only at index 1. And so on ...
=back
=head4 Dense Format
If VECTOR uses dense format, switch --dense should be selected.
The 1st line in the dense VECTOR file should show -
N M
for N vectors represented using M columns.
Each line thereafter should show a single dense vector.
All dense vectors should have M space separated numbers that indicate the
values at the corresponding columns in the vector.
All non-zero values are treated as 1s in the binary vectors.
Sample Dense Input
9 12
0 0 0 1 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 1 0 0 1 0
1 0 0 0 0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 1 0 0 0 1
1 1 0 0 0 0 0 1 0 0 0 0
shows same vectors as shown in Sample Sparse Input in dense format.
VECTOR file could also optionally show the KEY file name on the first line.
If the VECTOR file starts with the <keyfile> name on the 1st line, the above
information should begin from the 2nd line onwards.
=head2 Optional Arguments:
=head4 --dense
This switch should be selected if the given VECTORs are in dense format.
This will also create the output similarity matrix in the dense format. By
default, sparse format is assumed for both input vectors and output
similarity matrix.
=head4 --measure SIM
Specify the similarity measure to be used to construct the similarity matrix.
Possible option values for --sim are
match Match Coefficient
dice Dice Coefficient
overlap Overlap Coefficient
jaccard Jaccard Coefficient
cosine Cosine Coefficient
The following table shows how the similarity values are computed by each
of these measures.
match |intersection(X,Y)|
dice 2*|intersection(X,Y)|/(|X|+|Y|)
overlap |intersection(X,Y)|/(min(|X|,|Y|))
jaccard |intersection(X,Y)|/|union(X,Y)|
cosine |intersection(X,Y)|/sqrt(|X|*|Y|)
where X and Y represent any two binary vectors
|X| shows the norm or number of bits set in vector X
=head4 --format FORM
Specifies numeric format for representing output similarity values.
Possible values of FORM are
iN -> integer format allocating total N bytes/digits for each entry
fN.M -> floating point format allocating total N bytes/digits for each entry of which last M digits show the fractional part.
For matching coefficient, default is i8 and for other measures, default is
f16.10.
=head3 Other Options :
=head4 --help
Displays this message.
=head4 --version
Displays the version information.
=head1 OUTPUT
If the input VECTORs are in sparse format (default), output is also created
in sparse format, while, if the input vectors are in dense format, output
is also in dense format.
=head2 Sparse Output (Default)
The 1st line in the sparse output shows 2 space separated numbers -
N NNZ1
where
N = Number of vectors (same as the N in the VECTOR file)
NNZ1 = Number of non-zero entries in the output similarity matrix
Each line after this will show the corresponding row of the
similarity matrix in sparse format. Specifically, each i'th line after the
above line shows the list of 'j SIM' pairs separated by space such that
SIM is the non-zero similarity value between the i'th and j'th vectors in the
given VECTOR file.
Note that, only those pairs are listed in which the SIM value is non-zero.
Sample Sparse Output
9 43
1 1.000 3 0.354 4 0.500 5 0.707
2 1.000 3 0.500 4 0.707 7 1.000 9 0.577
1 0.354 2 0.500 3 1.000 4 0.354 5 0.500 7 0.500 8 0.289 9 0.577
1 0.500 2 0.707 3 0.354 4 1.000 7 0.707 9 0.408
1 0.707 3 0.500 5 1.000
6 1.000 8 0.408
2 1.000 3 0.500 4 0.707 7 1.000 9 0.577
3 0.289 6 0.408 8 1.000 9 0.333
2 0.577 3 0.577 4 0.408 7 0.577 8 0.333 9 1.000
Shows the actual full similarity matrix -
9
1.000 0.000 0.354 0.500 0.707 0.000 0.000 0.000 0.000
0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
0.354 0.500 1.000 0.354 0.500 0.000 0.500 0.289 0.577
0.500 0.707 0.354 1.000 0.000 0.000 0.707 0.000 0.408
0.707 0.000 0.500 0.000 1.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.408 0.000
0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
0.000 0.000 0.289 0.000 0.000 0.408 0.000 1.000 0.333
0.000 0.577 0.577 0.408 0.000 0.000 0.577 0.333 1.000
Both these matrices show the pairwise cosine similarities among the same 9
vectors shown in the Sample Sparse and Dense Input sections.
=head2 Dense Output
If --dense is selected, output is created in dense format and shows all
similarity values including 0s.
Dense output shows the number of vectors (N) on the first line. Each i'th line
after this shows N space separated numbers such that a number at column
index j shows the pairwise similarity among the i'th and j'th vectors.
Sample Dense Output
9
1.000 0.000 0.354 0.500 0.707 0.000 0.000 0.000 0.000
0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
0.354 0.500 1.000 0.354 0.500 0.000 0.500 0.289 0.577
0.500 0.707 0.354 1.000 0.000 0.000 0.707 0.000 0.408
0.707 0.000 0.500 0.000 1.000 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.408 0.000
0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
0.000 0.000 0.289 0.000 0.000 0.408 0.000 1.000 0.333
0.000 0.577 0.577 0.408 0.000 0.000 0.577 0.333 1.000
Shows the pairwise similarities among the 9 vectors shown in the Sample
Dense Input section.
Note that the similarity matrix will always be square and symmetric.
If the first line of the input VECTOR file shows the <keyfile> tag,
bitsimat.pl will display the same <keyfile> tag on the first line of
output.
=head1 SYSTEM REQUIREMENTS
bitsimat is dependent on the following CPAN modules :
=over
=item Bit::Vector - L<http://search.cpan.org/dist/Bit-Vector/>
=item Set::Scalar - L<http://search.cpan.org/dist/Set-Scalar/>
=back
=head1 AUTHORS
Amruta Purandare, University of Pittsburgh
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
=head1 COPYRIGHT
Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
=cut
###############################################################################
# THE CODE STARTS HERE
use Bit::Vector;
use Set::Scalar;
#$0 contains the program name along with
#the complete path. Extract just the program
#name and use in error messages
$0=~s/.*\/(.+)/$1/;
###############################################################################
# ================================
# COMMAND LINE OPTIONS AND USAGE
# ================================
# command line options
use Getopt::Long;
GetOptions ("help","version","measure=s","format=s","dense");
# show help option
if(defined $opt_help)
{
$opt_help=1;
&showhelp();
exit;
}
# show version information
if(defined $opt_version)
{
$opt_version=1;
&showversion();
exit;
}
# show minimal usage message if no arguments
if($#ARGV<0)
{
&showminimal();
exit;
}
if(!defined $opt_measure)
{
$opt_measure="cosine";
}
if($opt_measure !~ /(mat(ch)?)|(cos(ine)?)|(jac(card)?)|(dice)|(over(lap)?)/)
{
print STDERR "ERROR($0):
Wrong value of --measure=$opt_measure\n";
exit;
}
#############################################################################
# ================================
# INITIALIZATION AND INPUT
# ================================
if(!defined $ARGV[0])
{
print STDERR "ERROR($0):
Please specify the Vector file name...\n";
exit;
}
#accept the input file name
$infile=$ARGV[0];
if(!-e $infile)
{
print STDERR "ERROR($0):
Vector file <$infile> doesn't exist...\n";
exit;
}
open(IN,$infile) || die "Error($0):
Error(code=$!) in opening <$infile> file.\n";
# format for printing output
if(defined $opt_format)
{
# integer
if($opt_format=~/^i(\d+)$/)
{
$format="%$1d";
# as no sim. measure takes -ve value
# no need to check for underflow
$upper_format="";
while(length($upper_format)<($1-1))
{
$upper_format.="9";
}
}
# float
elsif($opt_format=~/^f(\d+)\.(\d+)$/)
{
$format="%$1\.$2f";
$upper_format="";
while(length($upper_format)<($1-$2-2))
{
$upper_format.="9";
}
$upper_format.=".";
while(length($upper_format)<($1-1))
{
$upper_format.="9";
}
}
else
{
print STDERR "ERROR($0):
Wrong format value --format=$opt_format.\n";
exit;
}
}
# default
else
{
# i8 for match
if($opt_measure=~/mat/)
{
$format="%8d";
$upper_format="9999999";
}
# f16.10 for other measures
else
{
$format="%16.10f";
$upper_format="9999.9999999999";
}
}
##############################################################################
# ==========================
# CODE SECTION
# ==========================
if(defined $opt_dense)
{
$line_num=0;
while(<IN>)
{
$line_num++;
# trimming extra spaces
chomp;
s/\s*$//g;
s/^\s*//g;
s/\s+/ /g;
# line 1 should show either KEY file or rows cols
if($line_num==1)
{
$line=$_;
if(/keyfile/)
{
# when input starts with <keyfile> tag
# output displays same on the 1st line
print "$_\n";
$line=<IN>;
$line_num++;
}
if($line=~/^\s*(\d+)\s+(\d+)\s*$/)
{
$rows=$1;
$cols=$2;
print "$rows\n";
}
else
{
print STDERR "ERROR($0):
Line $line_num in Vector file <$infile> should show the number of rows
and columns.\n";
exit;
}
next;
}
# each vector should have exactly specified number of elements
@row_ele=split(/\s+/);
if(scalar(@row_ele) != $cols)
{
print STDERR "ERROR($0):
Line $line_num in Vector file <$infile> should show $cols number of
elements.\n";
exit;
}
# loading a bit vector
$vector=Bit::Vector->new($cols);
foreach $ind (0..$#row_ele)
{
if($row_ele[$ind] != 0)
{
$vector->Bit_On($ind);
}
}
push @vectors,$vector;
}
# number of vectors should be same as the number specified earlier at
# line 1 or 2
if(scalar(@vectors) != $rows)
{
print STDERR "ERROR($0):
Vector file <$infile> doesn't contain $rows number of vectors.\n";
exit;
}
# computing similarity measure between each pair of vectors
foreach $i (0..$rows-1)
{
foreach $j (0..$rows-1)
{
if($i != $j)
{
$inter=Bit::Vector->new($cols);
$inter->And($vectors[$i],$vectors[$j]);
$inter_size=$inter->Norm();
if($opt_measure =~ /^mat(ch)?$/)
{
$sim=$inter_size;
}
elsif($opt_measure =~ /^dice$/)
{
$sim=(2*$inter_size)/($vectors[$i]->Norm()+$vectors[$j]->Norm());
}
elsif($opt_measure =~ /^jac(card)?$/)
{
$union=Bit::Vector->new($cols);
$union->Union($vectors[$i],$vectors[$j]);
if($union->Norm()!=0)
{
$sim=($inter_size)/($union->Norm());
}
else
{
$sim=0;
}
}
elsif($opt_measure =~ /^over(lap)?$/)
{
$size1=$vectors[$i]->Norm();
$size2=$vectors[$j]->Norm();
$min=($size1 < $size2) ? $size1 : $size2;
$sim=($inter_size)/$min;
}
elsif($opt_measure =~ /^cos(ine)?$/)
{
$size1=$vectors[$i]->Norm();
$size2=$vectors[$j]->Norm();
$denom=sqrt($size1*$size2);
if($denom != 0)
{
$sim=($inter_size)/$denom;
}
else
{
$sim=0;
}
}
}
else
{
if($vectors[$i]->Norm())
{
$sim=1;
}
else
{
$sim=0;
}
}
$value=sprintf $format, $sim;
if($value>$upper_format)
{
print STDERR "ERROR($0):
Floating point overflow.
Value <$value> can't be represented with format $format.\n";
exit 1;
}
print $value;
}
print "\n";
}
}
else
{
$line_num=1;
$line=<IN>;
if($line=~/keyfile/)
{
print "$line";
$line=<IN>;
$line_num++;
}
if($line=~/^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/)
{
$rows=$1;
$cols=$2;
$nnz1=$3;
}
else
{
print STDERR "ERROR($0):
Line$line_num in Vector file <$infile> should show #rows #cols #nnz.\n";
exit;
}
$nnz=0;
while(<IN>)
{
$line_num++;
chomp;
s/^\s*//;
s/\s*$//;
s/\s+/ /;
$set=Set::Scalar->new;
@pairs=split;
for($i=0;$i<$#pairs;$i=$i+2)
{
$index=$pairs[$i];
if($index>$cols)
{
print STDERR "ERROR($0):
Index <$index> at line <$line_num> in VECTOR file <$infile>
exceeds #cols = <$cols> specified in the header line.\n";
exit;
}
$value=$pairs[$i+1];
if($value==0)
{
print STDERR "ERROR($0):
0 value found in sparse vector at line <$line_num> in VECTOR file
<$infile>.\n";
exit;
}
# add index to set
$set->insert($index);
$nnz++;
}
push @vectors,$set;
}
close IN;
if(scalar(@vectors) != $rows)
{
print STDERR "ERROR($0):
#rows = $rows specified in the header line of the VECTOR file <$infile>
does not match the actual #rows = " . scalar(@vectors) . " found in the file.\n";
exit;
}
if($nnz != $nnz1)
{
print STDERR "ERROR($0):
#nnz = $nnz1 specified in the header line of the VECTOR file <$infile>
does not match the actual #nnz = $nnz found in the file.\n";
exit;
}
# writing output to a TEMP file
$tempfile="tempfile" . time() . ".tmp";
if(-e $tempfile)
{
print STDERR "ERROR($0):
Temporary file <$tempfile> should not already exist.\n";
exit;
}
open(TEMP,">$tempfile") || die "Error($0):
Error(code=$!) in opening tempfile <$tempfile>.\n";
$nnz=0;
foreach $i (0..$#vectors)
{
foreach $j (0..$#vectors)
{
undef $simcount;
if($i==$j && (!$vectors[$i]->is_null))
{
$simcount=1;
}
else
{
$s=$vectors[$i];
$t=$vectors[$j];
# checking for null vectors
if((!$s->is_null) || (!$t->is_null))
{
$inter_st=$s->intersection($t);
# match coefficient
if($opt_measure =~ /mat/)
{
# match = |(s,t)|
$simcount=$inter_st->size;
}
# dice
elsif($opt_measure =~ /dic/)
{
$sum_st=$s->size+$t->size;
# dice = 2* |(s,t)|/(|s|+|t|)
$simcount=(2*$inter_st->size)/$sum_st;
}
# jaccard
elsif($opt_measure =~ /jac/)
{
$union_st=$s->union($t);
# jaccard = |(s,t)|/(|sUt|)
$simcount=($inter_st->size)/($union_st->size);
}
# overlap
elsif($opt_measure =~ /ove?r/)
{
# overlap = |(s,t)|/min(|s|,|t|)
$min_st=($s->size<$t->size) ? $s->size:$t->size;
if($min_st!=0)
{
$simcount=($inter_st->size)/$min_st;
}
}
# cosine
elsif($opt_measure =~ /cos/)
{
if((!$s->is_null) && (!$t->is_null))
{
$denom=sqrt(($s->size)*($t->size));
# cosine = |(s,t)|/sqrt(|s|*|t|)
$simcount=($inter_st->size)/$denom;
}
}
}
}
if(defined $simcount && $simcount != 0)
{
$value=sprintf $format, $simcount;
print TEMP "" . ($j+1) ." $value ";
$nnz++;
}
}
print TEMP "\n";
}
close TEMP;
open(TEMP,$tempfile) || die "Error($0):
Error(code=$!) in opening tempfile <$tempfile>.\n";
# printing to stdout
print "$rows $nnz\n";
while(<TEMP>)
{
print;
}
close TEMP;
unlink "$tempfile";
}
undef $opt_dense;
##############################################################################
# ==========================
# SUBROUTINE SECTION
# ==========================
#-----------------------------------------------------------------------------
#show minimal usage message
sub showminimal()
{
print "Usage: bitsimat.pl [OPTIONS] VECTOR";
print "\nTYPE bitsimat.pl --help for help\n";
}
#-----------------------------------------------------------------------------
#show help
sub showhelp()
{
print "Usage: bitsimat.pl [OPTIONS] VECTOR
Constructs a similarity matrix for given binary context vectors.
VECTOR
A file containing binary context vectors.
OPTIONS:
--dense
Should be selected if the given VECTORs are in dense format. By default,
sparse format is assumed.
--measure SIM
Specify a measure of similarity to be used for creating similarity
matrix. Possible option values for --measure are
match Match Coefficient
dice Dice Coefficient
overlap Overlap Coefficient
jaccard Jaccard Coefficient
cosine Cosine Coefficient
--format FORM
Specifies numeric format for representing output similarity values.
Default format for match coefficient is i8. For other measures,
default is f16.10.
--help
Displays this message.
--version
Displays the version information.
Type 'perldoc bitsimat.pl' to view detailed documentation of bitsimat.\n";
}
#------------------------------------------------------------------------------
#version information
sub showversion()
{
# print "bitsimat.pl - Version 0.04\n";
print '$Id';
print "\nConstructs a similarity matrix from binary context vectors\n";
# print "Copyright (c) 2002-2005, Amruta Purandare, Ted Pedersen.\n";
# print "Date of Last Update: 06/01/2004\n";
}
#############################################################################