#!/usr/local/bin/perl -w
=head1 NAME
order1vec.pl - Convert Senseval-2 format contexts into first order feature vectors in Cluto format
=head1 SYNOPSIS
order1vec.pl [OPTIONS] SVAL2 FEATURE_REGEX
Type C<order1vec.pl --help> for a quick summary of options
=head1 DESCRIPTION
Convert a context into a first order feature vector which shows how which features
occured in the contexts. The possible features are identified via Perl regular
expressions of the form created by L<nsp2regex.pl>.
=head1 INPUT
=head2 Required Arguments:
=head3 SVAL2
A tokenized, preprocessed and well formatted Senseval-2 instance file showing
instances whose context vectors are to be generated.
Context of each instance should be delimited within <context> and </context>
tags. It is required that each XML tag in the Senseval-2 file appears on a
separate line. Tokens should be space separated.
=head3 FEATURE_REGEX
A file containing Perl regular expressions for features as created by
nsp2regex.pl.
Sample FEATURE_REGEX files -
=over
=item 1.
/\s(<[^>]*>)*time(<[^>]*>)*\s/ @name = time
/\s(<[^>]*>)*task(<[^>]*>)*\s/ @name = task
/\s(<[^>]*>)*believe(<[^>]*>)*\s/ @name = believe
/\s(<[^>]*>)*life(<[^>]*>)*\s/ @name = life
/\s(<[^>]*>)*control(<[^>]*>)*\s/ @name = control
/\s(<[^>]*>)*words(<[^>]*>)*\s/ @name = words
/\s(<[^>]*>)*define(<[^>]*>)*\s/ @name = define
Explanation :
=over
=item 1.
The above FEATURE_REGEX file shows total 7 unigram features, single feature on each line.
=item 2.
Feature names are shown by "@name = FEATURE_NAME" that follows the actual
feature regex/s.
=item 3.
Tokens in the SVAL2 file should be separated by exactly one blank space. Any
non-tokens if exist should be put inside the angular brackets e.g. <item>, <sat>
=back
=item 2.
/\s(<[^>]*>)*personal(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*computer(<[^>]*>)*\s/ @name = personal<>computer
/\s(<[^>]*>)*stock(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*market(<[^>]*>)*\s/ @name = stock<>market
/\s(<[^>]*>)*electronic(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*systems(<[^>]*>)*\s/ @name = electronic<>systems
/\s(<[^>]*>)*toll(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*free(<[^>]*>)*\s/ @name = toll<>free
Shows a bigram feature file in which each feature includes two tokens
separated by single space or any number of non-token sequences in <> brackets.
More explanation on feature regex creation is given in the perldoc
of the nsp2regex program.
NOTE: Null columns are discarded i.e. the features which do not occur in any
of the contexts are dropped, and when --transpose option is specified (see
below for details), contexts that do not contain any features are dropped as
well.
=back
=head2 Optional Arguments:
=head3 --binary
By default, order1vec creates frequency context vectors that show how many
times each feature occurs in the context. --binary will instead create binary
context vectors where 1 indicates presence of feature and 0 indicates
absence of feature in the context.
=head3 --dense
By default, context vectors will have sparse format. --dense will display
output context vectors in dense format.
=head3 --rlabel RLABELFILE
Creates a RLABELFILE containing row labels for Cluto's --rlabelfile option.
Each line in the RLABELFILE shows an instance id of the instance whose context
vector is shown on the corresponding line on STDOUT.
Instance ids are extracted from the SVAL2 file by matching regex
/instance id\s*=\s*"IID"/
where 'IID' is an instance id of the <context> that follows this <instance> tag.
NOTE: When the --transpose option is specified, the contents of the RLABELFILE
and the CLABELFILE are swapped.
=head3 --rclass RCLASSFILE
Creates RCLASSFILE for Cluto's --rclassfile option. Each line in the
RCLASSFILE shows true sense id of the instance whose context vector appears on
the corresponding line on STDOUT.
Sense ids are extracted from the SVAL2 file by matching regex
/sense\s*id\s*=\s*"SID"\/>/
where SID shows a true sense tag of the instance whose IID is recently
extracted by matching
/instance id\s*=\s*"IID"/
This option cannot be specified when the --transpose option is specified.
=head3 --clabel CLABELFILE
Creates a CLABELFILE containing column labels for Cluto's --clabelfile option.
Each line in the CLABELFILE shows a feature representing corresponding column
of the output context vectors.
Features are extracted from the FEATURE_REGEX file by matching string
"@name = FEATURE" where FEATURE shows the feature name.
NOTE: When the --transpose option is specified, the contents of the RLABELFILE
and the CLABELFILE are swapped.
=head3 --transpose
Creates feature vectors instead of the default context vectors. The output
is a Latent Semantic Analysis style feature-by-context matrix, instead of the
default context-by-feature matrix that is native to SenseClusters. As a
result, the contents of the RLABELFILE and CLABELFILE are swapped, i.e. the
list of features is output to the RLABELFILE and the list of contexts is
output to the CLABELFILE.
=head3 --testregex TEST_REGEX
Creates a TEST_REGEX file containing only those regular expressions from the
input FEATURE_REGEX file that matched at least once in the input SVAL2 file.
This list can be different from the original list in FEATURE_REGEX when
different training data has been used to identify features or when a different
scope has been used for training and test data creation.
This option is required when the --transpose option is specified, in order to
ensure creation of a compatible TEST_REGEX file that corresponds to the
output of order1vec.pl in --transpose mode, so that both the output and the
TEST_REGEX can be directly passed as inputs to the order2vec.pl program.
=head3 --showkey
Displays the name of a system generated KEY file on the first line of STDOUT.
KEY file preserves the instance ids and sense tags of the instances in the
given SVAL2 file. This information will be automatically used by some of the
clustering and evaluation programs in SenseClusters that operate on purely
numeric instance formats. The option should be selected if the user is planning
to run SenseClusters' clustering code.
This option cannot be specified when the --transpose option is specified, as
no KEY file is generated in --transpose mode.
=head3 --target TARGETREGEX
Specifies a file containing Perl regex/s that define the target word. By
default, target.regex file is assumed to exist in the current directory.
=head3 --extarget
This will exclude the target word from features if the target word (as
specified by the --target option or default target.regex file) appears
in the FEATURE_REGEX file. In other words, the feature dimensions
of the output context vectors will not include the target word even if
target word is listed in the FEATURE_REGEX file.
=head2 Other Options :
=head3 --help
Displays this message.
=head3 --version
Displays the version information.
=head1 OUTPUT
=head2 KEY file
When --transpose is not specified, order1vec automatically generates a KEY
file that preserves the instance ids and sense tags of the SVAL2 instances.
Each line in the KEY file shows an instance id and one or more sense tags
of the instance represented by a context vector on the corresponding line
on STDOUT. i.e. the ith line in the KEY file shows the instance and sense ids
of the ith instance in the SVAL2 file or the ith vector displayed on stdout.
Sample KEY file looks like
<instance id="line-n.w8_020:7099:"/> <sense id="phone"/>
<instance id="line-n.w8_132:15431:"/> <sense id="phone"/>
<instance id="line-n.w8_027:13762:"/> <sense id="phone"/>
<instance id="line-n.w7_114:8965:"/> <sense id="text"/>
<instance id="line-n.w7_065:1553:"/> <sense id="product"/>
<instance id="line-n.w9_4:9437:"/> <sense id="product"/>
Or
<instance id="line-n.w8_020:7099:"/> <sense id="NOTAG"/>
<instance id="line-n.w7_111:238:"/> <sense id="NOTAG"/>
<instance id="line-n.w7_011:12078:"/> <sense id="NOTAG"/>
<instance id="line-n.w7_095:17576:"/> <sense id="NOTAG"/>
<instance id="line-n.w7_080:10129:"/> <sense id="NOTAG"/>
<instance id="line-n.w9_4:2358:"/> <sense id="NOTAG"/>
when the sense ids of instances are not available in the input SVAL2 file.
Or
<instance id="hard-a.sjm-180_1:"/> <sense id="HARD1"/> <sense id="HARD2"/>
<instance id="hard-a.br-l15:"/> <sense id="HARD1"/>
<instance id="hard-a.sjm-242_12:"/> <sense id="HARD2"/>
<instance id="hard-a.sjm-070_4:"/> <sense id="HARD1"/> <sense id="HARD3"/>
<instance id="hard-a.sjm-168_4:"/> <sense id="HARD3"/>
when some instances have multiple sense tags.
=head2 Context Vectors on STDOUT (when --transpose is NOT specified)
=head3 Sparse Format (SenseClusters Native Representation)
By default (unless --dense is specified), output vectors will be created in
sparse format.
The first line on stdout will show 3 numbers separated by blanks as
N M NNZ
where
N = Number of instances in SVAL2 file
M = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file
NNZ = Total number of non-zero entries in all sparse vectors
Each line thereafter shows a single sparse context vector on each line. In
short, every ith line after the 1st line shows the context vector of the
i'th instance in the given SVAL2 file.
Each sparse vector is a list of pairs of numbers separated by space such
that the first number in a pair is the index of a non-zero value in the
vector and the second number is a non-zero value itself corresponding to
that index.
=head4 Sample Sparse Output
12 18 31
1 1 2 1
1 1 2 2 3 2 4 1
4 1
5 1 6 2
5 2 6 3 7 1 8 2 9 1
9 1
7 1
8 1 10 1
4 2 11 3 12 2 13 4 14 1
15 1
14 1 15 1
3 1 8 1 16 4 17 4 18 4
Note that,
=over
=item 1.
First Line shows that there are total 12 sparse vectors, represented
using total 18 features, and total 31 non-zero values.
=item 2.
Each vector (all lines except the 1st line) is a list of 'index value'
pairs separated by space. e.g. 1st vector (line 2) shows that features at
indices 1 and 2 appear once in the 1st instance. 2nd vector (3rd line)
shows that features at indices 1 and 4 appear once while those at indices
2 and 3 appear twice each in the 2nd instance.
Feature indices start from 1, to be consistent with Cluto's matrix format
standard.
=item 3.
If --binary is set ON, all non-zero values will have value 1 showing
mere presence of feature in the context rather than the frequency counts.
=back
=head3 Dense Format (SenseClusters Native Representation)
When --dense option is selected, order1vec will create output in dense
vector format.
First line on STDOUT will show exactly two numbers separated by space.
The first number indicates the number of vectors and the second number
indicates the number of features (dimensions of the context vectors).
Each line thereafter shows a single context vector such that ith line after
the 1st line shows the context vector of the ith instance in the SVAL2 file.
=head4 Sample Dense Output
12 18
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 2 3 1 2 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 3 2 4 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 4 4 4
shows same context vectors as shown in Sample Sparse Format but in dense
format.
Note that
=over
=item 1. All vectors have same length and is same as the number of features
(here 18) from the given FEATURE_REGEX file that matched at least once in
the SVAL2 file.
=item 2. When --binary is ON, value at column j in a vector will be 1
for every feature j that is found at least once in the context.
=item 3. When --binary is not used, value at column j in a vector shows the
number of times the jth feature is found in the context.
=item 4. A 0 at column j of any vector shows that the jth feature in the
FEATURE_REGEX file doesn't appear in that context.
=back
When --showkey is selected, output will be exactly same as described above
except the first line will show the KEY file name that is required by
the SenseClusters' programs.
e.g.
<keyfile name="KEY"/>
12 18 31
1 1 2 1
1 1 2 2 3 2 4 1
4 1
5 1 6 2
5 2 6 3 7 1 8 2 9 1
9 1
7 1
8 1 10 1
4 2 11 3 12 2 13 4 14 1
15 1
14 1 15 1
3 1 8 1 16 4 17 4 18 4
Shows same vectors as shown in Sample Sparse Output when --showkey is ON.
Value of KEY shown in the <keyfile> tag will be the system generated KEY
file name.
=head2 Features Vectors on STDOUT (when --transpose IS specified)
Note that --testregex TEST_REGEX is a required option when --transpose is
specified.
=head3 Sparse Format (Latent Semantic Analysis Representation)
By default (unless --dense is specified), output vectors will be created in
sparse format.
The first line on stdout will show 3 numbers separated by blanks as
N M NNZ
where
N = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file
M = Number of instances in SVAL2 file, for which at least one feature was identified
NNZ = Total number of non-zero entries in all sparse vectors
Each line thereafter shows a single sparse feature vector on each line. In
short, every ith line after the 1st line shows the feature vector of the
i'th feature in the created TEST_REGEX file.
Each sparse vector is a list of pairs of numbers separated by space such
that the first number in a pair is the index of a non-zero value in the
vector and the second number is a non-zero value itself corresponding to
that index.
=head4 Sample Sparse Output (Transpose of the Context Vectors output above)
18 12 31
1 1 2 1
1 1 2 2
2 2 12 1
2 1 3 1 9 2
4 1 5 2
4 2 5 3
5 1 7 1
5 2 8 1 12 1
5 1 6 1
8 1
9 3
9 2
9 4
9 1 11 1
10 1 11 1
12 4
12 4
12 4
Note that,
=over
=item 1.
First Line shows that there are total 18 sparse feature vectors, represented
using total 12 contexts, and total 31 non-zero values.
=item 2.
Each vector (all lines except the 1st line) is a list of 'index value'
pairs separated by space. e.g. 1st vector (line 2) shows that contexts at
indices 1 and 2 contain the 1st feature once each. 3rd vector (4th line)
shows that context at index 2 contains the 3rd feature 2 times and the context
at index 12 contains the 3rd feature once.
Context indices start from 1, to be consistent with Cluto's matrix format
standard.
=item 3.
If --binary is set ON, all non-zero values will have value 1 showing
mere presence of feature in the context rather than the frequency counts.
=back
=head3 Dense Format (Latent Semantic Analysis Representation)
When --dense option is selected, order1vec will create output in dense
vector format.
First line on STDOUT will show exactly two numbers separated by space.
The first number indicates the number of vectors and the second number
indicates the number of contexts (dimensions of the feature vectors).
Each line thereafter shows a single feature vector such that ith line after
the 1st line shows the context vector of the ith instance in the SVAL2 file.
=head4 Sample Dense Output (Transpose of the dense output of Context Vectors
above)
18 12
1 1 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 0
0 2 0 0 0 0 0 0 0 0 0 1
0 1 1 0 0 0 0 0 2 0 0 0
0 0 0 1 2 0 0 0 0 0 0 0
0 0 0 2 3 0 0 0 0 0 0 0
0 0 0 0 1 0 1 0 0 0 0 0
0 0 0 0 2 0 0 1 0 0 0 1
0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 3 0 0 0
0 0 0 0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 4 0 0 0
0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 0 4
0 0 0 0 0 0 0 0 0 0 0 4
0 0 0 0 0 0 0 0 0 0 0 4
shows same context vectors as shown in Sample Sparse Format but in dense
format.
Note that
=over
=item 1. All vectors have same length and is same as the number of contexts
(here 12) from the given SVAL2 file that contained at least one feature
from the TEST_REGEX file.
=item 2. When --binary is ON, value at column j in a vector will be 1
for every context j that contains the feature at least once.
=item 3. When --binary is not used, value at column j in a vector shows the
number of times the feature is found in the jth context.
=item 4. A 0 at column j of any vector shows that the feature
doesn't appear in the jth context.
=back
=head1 SYSTEM REQUIREMENTS
=over
=item PDL - L<http://search.cpan.org/dist/PDL/>
=item Math::SparseVector - L<http://search.cpan.org/dist/Math-SparseVector/>
=back
=head1 BUGS
This program behaves unpredictably if the input file is not in
Senseval2 format. No error message is given, and it will produce
numeric output, but of course it has no real meaning. A check
should be added to make sure the input file is in Senseval2 format.
=head1 AUTHOR
Ted Pedersen, University of Minnesota, Duluth
tpederse at d.umn.edu
Amruta Purandare, University of Pittsburgh
Anagha Kulkarni, Carnegie-Mellon University
Mahesh Joshi, Carnegie-Mellon University
=head1 COPYRIGHT
Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
=cut
###############################################################################
# ==============================
# THE CODE STARTS HERE
# ==============================
#$0 contains the program name along with
#the complete path. Extract just the program
#name and use in error messages
$0=~s/.*\/(.+)/$1/;
# PDL is used for dense vectors
use PDL;
use PDL::NiceSlice;
use PDL::Primitive;
# Math::SparseVector is used for sparse vectors
use Math::SparseVector;
# Math::SparseMatrix for sparse matrix transpose
# functionality
use Math::SparseMatrix;
###############################################################################
# ================================
# COMMAND LINE OPTIONS AND USAGE
# ================================
# command line options
use Getopt::Long;
GetOptions ("help","version","showkey","rlabel=s","rclass=s","clabel=s","binary","target=s","extarget","dense", "transpose", "testregex=s");
# show help option
if(defined $opt_help)
{
$opt_help=1;
&showhelp();
exit;
}
# show version information
if(defined $opt_version)
{
$opt_version=1;
&showversion();
exit;
}
# show minimal usage message if fewer arguments
if($#ARGV<1)
{
&showminimal();
exit 1;
}
if (!defined $opt_transpose) {
$opt_transpose = 0;
}
if ($opt_transpose != 0 && !defined $opt_testregex) {
print STDERR "ERROR($0):
--transpose cannot be specified without specifying --testregex
TEST_REGEX.\n";
exit 1;
}
if ($opt_transpose != 0 && defined $opt_rclass) {
print STDERR "ERROR($0):
--rclass cannot be specified when using --transpose option.\n";
exit 1;
}
if ($opt_transpose != 0 && defined $opt_showkey) {
print STDERR "ERROR($0):
--showkey cannot be specified when using --transpose option.\n";
exit 1;
}
#############################################################################
# ================================
# INITIALIZATION AND INPUT
# ================================
# -------------
# SVAL2 file
# -------------
if(!defined $ARGV[0])
{
print STDERR "ERROR($0):
Please specify the SVAL2 file.\n";
exit 1;
}
#accept the SVAL2 file name
$infile=$ARGV[0];
if(!-e $infile)
{
print STDERR "ERROR($0):
SVAL2 file <$infile> doesn't exist...\n";
exit 1;
}
open(IN,$infile) || die "Error($0):
Error(code=$!) in opening the SVAL2 file <$infile>\n";
# -------------------
# Feature regex file
# -------------------
if(!defined $ARGV[1])
{
print STDERR "ERROR($0):
Please specify the Feature Regex file.\n";
exit 1;
}
#accept the feature file name
$featfile=$ARGV[1];
if(!-e $featfile)
{
print STDERR "ERROR($0):
Feature Regex file <$featfile> doesn't exist...\n";
exit 1;
}
open(FEAT,$featfile) || die "Error($0):
Error(code=$!) in opening Feature Regex file <$featfile>\n";
# -------------------
# Target Word regex
# -------------------
if(defined $opt_extarget)
{
#file containing regex/s for target word
if(defined $opt_target)
{
$target_file=$opt_target;
if(!(-e $target_file))
{
print STDERR "ERROR($0):
Target regex file <$target_file> doesn't exist.\n";
exit 1;
}
}
else
{
$target_file="target.regex";
if(!-e $target_file)
{
print STDERR "ERROR($0):
Please copy the target.regex file into the current directory or specify
the target regex file via --target option.\n";
exit 1;
}
}
# ------------------------
# creating target regex
# ------------------------
open(REG,$target_file) || die "ERROR($0):
Error(error code=$!) in opening the target regex file <$target_file>\n";
while(<REG>)
{
chomp;
s/^\s+//g;
s/\s+$//g;
if(/^\s*$/)
{
next;
}
if(/^\//)
{
s/^\///;
}
else
{
print STDERR "ERROR($0):
Regular Expression <$_> should start with '/'\n";
exit 1;
}
if(/\/$/)
{
s/\/$//;
}
else
{
print STDERR "ERROR($0):
Regular Expression <$_> should end with '/'\n";
exit 1;
}
$target.="(".$_.")|";
}
if(!defined $target)
{
print STDERR "ERROR($0):
No valid Perl regular expression found in the target regex file
<$target_file>\n";
exit 1;
}
else
{
chop $target;
}
}
##############################################################################
# =======================
# Read Feature Regex/s
# =======================
$line_num=0;
while(<FEAT>)
{
$line_num++;
chomp;
s/^\s*//;
s/\s*$//;
if(/(.*)\s*\@name\s*=\s*(.*)/)
{
$feature_regex=$1;
$feature=$2;
# removing leading and lagging blank spaces
$feature_regex=~s/^\s*//;
$feature_regex=~s/\s*$//;
$feature=~s/^\s*//;
$feature=~s/\s*$//;
# removing the starting and ending slashes //
if($feature_regex=~/^\//) { $feature_regex=~s/^\///; }
else
{
print STDERR "ERROR($0):
Feature regex <$feature_regex>
at line <$line_num> in Feature Regex file <$featfile> should start
with '/'\n";
exit 1;
}
if($feature_regex=~/\/$/) { $feature_regex=~s/\/$//; }
else
{
print STDERR "ERROR($0):
Feature regex <$feature_regex>
at line <$line_num> in Feature Regex file <$featfile> should end
with '/'\n";
exit 1;
}
# target word is a feature only when --extarget is not
# selected or feature regex doesn't match with target
# regex
if(!defined $opt_extarget || $feature !~ /^$target$/)
{
push @features,$feature_regex;
# we require the @name part of the nsp2regex output if column labels
# or test regexes are requested
if(defined $opt_clabel || defined $opt_testregex)
{
push @clabels, $feature;
}
}
}
else
{
print STDERR "ERROR($0):
Line <$line_num> in Feature Regex file <$featfile> has an unexpected
format.\n";
exit 1;
}
}
#output vector will have
#columns = #features
$cols=scalar(@features);
##############################################################################
# =================================================
# CREATING CONTEXT VECTORS
# =================================================
# context vectors are temporarily written into a
# TEMP file
# if the program finishes successfully, this TEMP file
# is printed to STDOUT and is deleted
# otherwise TEMP file is retained and stores the partial
# program output
$tempfile="tempfile" . time() . ".order1vec";
if(-e $tempfile)
{
print STDERR "ERROR($0):
Temporary file <$tempfile> should not already exist.\n";
exit 1;
}
open(TEMP,">$tempfile") || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
# reading the SVAL2 file
$line_num=0;
if(defined $opt_dense)
{
# use PDL
$context_vector=zeroes($cols);
# PDL matrices are column major. Initially create a matrix
# with number of columns equal to number of features and
# number of rows = 1, filled with zeroes
$orig_matrix = zeroes($cols, 1);
}
else
{
# use Math::SparseVector module
$context_vector=Math::SparseVector->new;
$nnz=0;
}
$context_count = 0;
while(<IN>)
{
$line_num++;
if(/instance id\s*=\s*\"([^"]+)\"/)
{
$instance=$1;
if(defined $instance_ids{$instance})
{
print STDERR "ERROR($0):
Instance Id <$instance> is repeated in the SVAL2 file <$infile>\n";
exit 1;
}
push @instances,$instance;
$instance_ids{$instance}=1;
}
if(/<\/instance>/)
{
undef $instance;
}
if(/sense\s*id\s*=\s*\"([^"]+)\"/)
{
# no <instance> open
if(!defined $instance)
{
print STDERR "ERROR($0):
Missing <instance> tag before the <sense> tag at line <$line_num>
in SVAL2 file <$infile>\n";
exit 1;
}
$sense=$1;
if(defined $key_table{$instance}{$sense})
{
print STDERR "ERROR($0):
<instance-id, sense-tag> pair <$instance, $sense> is repeated in the
SVAL2 file <$infile>\n";
exit 1;
}
$key_table{$instance}{$sense}=1;
}
if(/<\/context>/)
{
undef $data_start;
# add dense vector to orig_matrix
if(defined $opt_dense)
{
# initially resize the original matrix to new number of contexts
# (actual increment in count is done later, since we use the current
# value of $context_count for indexing the orig_matrix)
$orig_matrix->reshape($cols, $context_count + 1);
# get the vector for the current context
$rowvec = $orig_matrix->slice(":,($context_count)");
# update the vector for the context in the the orig_matrix
$rowvec .= $context_vector;
}
# printing context vector to TEMP file
# sparse vector
else
{
foreach $key ($context_vector->keys)
{
print TEMP "$key " . $context_vector->get($key) . " ";
$nnz++;
}
print TEMP "\n";
}
# increment the number of contexts
$context_count++;
}
# contextual data
if(defined $data_start)
{
# nsp2regex features have format
# /\sFEATURE\s/ which requires a space
# on each side of the token
s/^(\S)/ $1/;
s/(\S)$/$1 /;
# ---------------------------------------------------
# the logic of matching feature regex/s is borrowed
# from the xml2arff.pl program from the SenseTools
# package by Satanjeev Banerjee and Ted Pedersen
# ---------------------------------------------------
foreach $index (0..$#features)
{
$feature_regex=$features[$index];
if(defined $opt_binary)
{
# match or not
if(/$feature_regex/)
{
if(defined $opt_dense)
{
$context_vector->set($index,1);
}
else
{
$context_vector->set($index+1,1);
}
}
}
else
{
# number of matches
while(/$feature_regex/g)
{
if(defined $opt_dense)
{
$context_vector($index)++;
}
else
{
$context_vector->incr($index+1);
}
}
}
}
}
# beginning of the context
if(/<context>/)
{
# no <instance> open
if(!defined $instance)
{
print STDERR "ERROR($0):
Missing <instance> tag before the <context> tag at line <$line_num>
in SVAL2 file <$infile>\n";
exit 1;
}
# no sense tag for this instance
if(!defined $key_table{$instance})
{
$sense="NOTAG";
$key_table{$instance}{$sense}=1;
}
$data_start=1;
if(defined $opt_dense)
{
$context_vector->inplace->zeroes;
}
else
{
$context_vector->free;
}
}
}
# if we are in dense mode, then TEMP file is
# created here
if(defined $opt_dense) {
if ($opt_transpose != 0) {
# create feature-by-context dense TEMP file
$transpose_matrix = transpose($orig_matrix);
for ($i = 0; $i < $cols; $i++) {
for ($j = 0; $j < $context_count; $j++) {
print TEMP $transpose_matrix->at($j,$i) . " ";
}
print TEMP "\n";
}
} else {
# create context-by-feature dense TEMP file
for ($i = 0; $i < $context_count; $i++) {
for ($j = 0; $j < $cols; $j++) {
print TEMP $orig_matrix->at($j,$i) . " ";
}
print TEMP "\n";
}
}
}
close TEMP;
undef $opt_extarget;
# added by AKK on 02/28/2005
# work-around for eliminating the columns (i.e. the features) which
# dont have any non-zero row entry i.e. the features that do not occur
# in any of the contexts.
my $mod_tempfile = "mod_tempfile" . time() . ".order1vec";
my @col = ();
if(!defined $opt_dense)
{
open(TEMP,$tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
# go through each row of the file till either of the following occurs:
# 1. we encounter atleast one entry for each column i.e. for each feature
# 2. we reach end of the file
my $flag = 0;
for($i=1;$i<=$cols;$i++)
{
$col[$i] = 0;
}
while(<TEMP>)
{
@elem = split(/\s+/);
# mark the column for which an entry was found
for($i=0;$i<=$#elem;$i=$i+2)
{
$col[$elem[$i]] = 1;
}
# check if an entry found for each column
$flag = 0;
for($i=1;$i<=$cols;$i++)
{
if($col[$i] == 0)
{
$flag = 1;
last;
}
}
# if an entry found for each column
# then exit the while loop.
# this situation suggests that we dont have any
# no entry column in this input data.
if($flag == 0)
{
last;
}
}
close TEMP;
# ON(1) state of flag variable suggests that the input matrix
# has one or more columns with no non-zero entries.
# Thus we need to remove these columns and adjust the column
# indices for all the columns following the removed column.
my %hash_col = ();
if($flag == 1)
{
# create the new column indices
$cnt = 1;
for($i=1;$i<=$#col;$i++)
{
# for the remaining columns
# adjust the column indices
if($col[$i] == 1)
{
$hash_col{$i} = $cnt;
$cnt++;
}
# when column dropped decrease
# total # of cols
else
{
$cols--;
}
}
# write the modified TEMP file to another temp file with the changed column indices.
open(TEMP,$tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
open(MOD,">$mod_tempfile") || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$mod_tempfile>\n";
while(<TEMP>)
{
@elem = split(/\s+/);
# print the column index and the cell value pairs for the context
for($i=0;$i<=$#elem;$i=$i+2)
{
print MOD $hash_col{$elem[$i]} . " " . $elem[$i+1] . " ";
}
print MOD "\n";
}
close TEMP;
close MOD;
}
}
# end by AKK on 02/28/2005
# for sparse mode, if --transpose is specified, we need to use
# Math::SpaarseMatrix for the transpose functionality
if (!defined $opt_dense && $opt_transpose != 0) {
# first prepare a temporary file for transpose function input.
# we need to eliminate any empty contexts from the original
# output of order1 represenataion
$transpose_in = "transpose_in" . time() . "order1vec";
# process the temporary file created above, to eliminate empty
# contexts, and create an input file for transposing
if(-e $mod_tempfile)
{
open(TEMP,$mod_tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
}
else
{
open(TEMP,$tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
}
open(TRANS_IN, "> $transpose_in") or die "ERROR($0):
Error(code=$!) while creating temporary input file <$transpose_in>
for transposing.\n";
# $linetowrite contains the content of output except
# blank lines representing empty contexts
$linetowrite = "";
$rows = @instances;
# in this process, instances might reduce, so we should create a
# new array of only the remaining instances. initially, just create
# an array containing all 1's indicating that no instances are dropped
for ($i = 0; $i < @instances; $i++) {
$nonempty_instances[$i] = 1;
}
# use index to determine which instances to ignore
$index = 0;
while ($line = <TEMP>) {
chomp $line;
if ($line ne "") {
$linetowrite .= "$line\n";
} else {
# do no print the empty line and reduce the row count
$rows--;
# put a 0 in the nonempty_instances array, indicating that the
# instance at this index in the @instances array is empty
$nonempty_instances[$index] = 0;
}
$index++;
}
# write the reduced number of contexts back, without empty lines
print TRANS_IN "$rows $cols $nnz\n";
print TRANS_IN $linetowrite;
close TEMP;
close TRANS_IN;
$transpose_sparsematrix = Math::SparseMatrix->createTransposeFromFile(
$transpose_in);
# create the transpose output
$transpose_out = "transpose_out" . time() . "order1vec";
$transpose_sparsematrix->writeToFile($transpose_out);
}
###########################################################################
# =========================
# OUTPUT SECTION
# =========================
# =====================
# Creating KEY file
# =====================
# DO NOT GENERATE A KEY FILE IN --transpose MODE
# KEY file is automatically created by the program
# and preserves the instance ids and sense tags of the
# SVAL-2 instances
if ($opt_transpose == 0) {
$keyfile="keyfile" . time() . ".key";
if(-e $keyfile)
{
print STDERR "ERROR($0):
System generated KEY file <$keyfile> should not already exist.\n";
exit 1;
}
open(KEY,">$keyfile") || die "ERROR($0):
Error(code=$!) in opening system generated KEY file <$keyfile>\n";
foreach $instance (@instances)
{
print KEY "<instance id=\"$instance\"\/> ";
foreach $sense (sort keys %{$key_table{$instance}})
{
print KEY "<sense id=\"$sense\"\/> ";
}
print KEY "\n";
}
close KEY;
}
# =========================
# Printing output vectors
# =========================
# printing KEY name when --showkey is ON
if(defined $opt_showkey)
{
print "<keyfile name=\"$keyfile\"\/>\n";
undef $opt_showkey;
}
# first line for sparse vectors shows
# N M NNZ
# while the first line in dense vectors shows
# N M
# where N = number of vectors = Number of instances in SVAL2
# M = number of dimensions = Number of features in FEATURE
# NNZ = total number of non-zero entries in sparse vectors
# Additionally, we also need to consider if the the --transpose was on,
# in which case N and M are swapped. But this file is already created
# in the Math::SparseMatrix transpose code called above. So in that case
# we simply open that file and print it at STDOUT
if (!defined $opt_dense && $opt_transpose != 0) {
# transpose and sparse
open (TRANS_OUT, "< $transpose_out") or die "ERROR($0):
Error (code=$!) while opening internal file <$transpose_out>\n";
while (<TRANS_OUT>) {
print;
}
close TRANS_OUT;
} else {
if ($opt_transpose != 0) {
# transpose and dense (since transpose and sparse would have
# been the "if" condition above)
print "$cols " . scalar(@instances);
} else {
# non-transpose and (sparse/dense)
print scalar(@instances) . " $cols";
}
if(!defined $opt_dense)
{
print " $nnz";
}
print "\n";
# this is followed by the actual context vectors
if(-e $mod_tempfile)
{
open(TEMP,$mod_tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
}
else
{
open(TEMP,$tempfile) || die "ERROR($0):
Error(code=$!) in opening internal temporary file <$tempfile>\n";
}
while(<TEMP>)
{
print;
}
close TEMP;
}
# deleting TEMP as the program is successfully finished
unlink $tempfile;
if(-e $mod_tempfile)
{
unlink $mod_tempfile;
}
if (defined $transpose_in && -e $transpose_in)
{
unlink $transpose_in;
}
if (defined $transpose_out && -e $transpose_out)
{
unlink $transpose_out;
}
undef $opt_binary;
# ==========================
# Creating Cluto files
# ==========================
# REMEMBER: if --transpose is specified, then row and column labels get
# interchanged
# writing rlabel file
if(defined $opt_rlabel)
{
$rlabel=$opt_rlabel;
if(-e $rlabel)
{
print STDERR "Warning($0):
Row label file <$rlabel> already exists, overwrite (y/n)? ";
$ans=<STDIN>;
}
if(!-e $rlabel || $ans=~/Y|y/)
{
open(RLAB,">$rlabel") || die "Error($0):
Error(code=$!) in opening the Row Label file <$rlabel>\n";
if ($opt_transpose == 0) {
# printing rlabels
foreach $instance (@instances)
{
print RLAB "$instance\n";
}
} else {
# printing column labels as row labels during transpose
if (!defined $opt_dense) {
# in sparse mode, we need to check for dropping
# column labels for empty columns
for ($index=1; $index <= @clabels; $index++)
{
if ($col[$index] > 0) {
print RLAB $clabels[$index-1] . "\n";
}
}
} else {
# in dense mode, output all column labels
for ($index=1; $index <= @clabels; $index++)
{
print RLAB $clabels[$index-1] . "\n";
}
}
}
close RLAB;
}
}
# writing rclass file
if(defined $opt_rclass)
{
$rclass=$opt_rclass;
if(-e $rclass)
{
print STDERR "Warning($0):
Class label file <$rclass> already exists, overwrite (y/n)? ";
$ans=<STDIN>;
}
if(!-e $rclass || $ans=~/Y|y/)
{
open(RCL,">$rclass") || die "Error($0):
Error(code=$!) in opening the Class Label file <$rclass>\n";
# printing rclasses
foreach $instance (@instances)
{
@senses=sort keys %{$key_table{$instance}};
if(scalar(@senses) > 1)
{
print STDERR "ERROR($0):
Instance <$instance> can not have multiple senses in RCLASSFILE.\n";
exit 1;
}
print RCL "$senses[0]\n";
}
close RCL;
}
}
# writing clabel file
if(defined $opt_clabel)
{
$clabel=$opt_clabel;
if(-e $clabel)
{
print STDERR "Warning($0):
Column label file <$clabel> already exists, overwrite (y/n)? ";
$ans=<STDIN>;
}
if(!-e $clabel || $ans=~/Y|y/)
{
open(CLAB,">$clabel") || die "Error($0):
Error(code=$!) in opening the Column Label file <$clabel>\n";
if ($opt_transpose == 0) {
# printing column labels
if (!defined $opt_dense) {
# in sparse mode, we need to check for dropping
# column labels for empty columns
for ($index=1; $index <= @clabels; $index++)
{
if ($col[$index] > 0) {
print CLAB $clabels[$index-1] . "\n";
}
}
} else {
# in dense mode, output all column labels
for ($index=1; $index <= @clabels; $index++)
{
print CLAB $clabels[$index-1] . "\n";
}
}
} else {
# printing rlabels as column labels during transpose
# check for empty contexts, and skip them in the output
if (!defined $opt_dense) {
# number of instances might reduce in sparse representation
# in --transpose option
for ($i = 0; $i < @instances; $i++)
{
if ($nonempty_instances[$i] == 1) {
print CLAB "$instances[$i]\n";
}
}
} else {
for ($i = 0; $i < @instances; $i++)
{
print CLAB "$instances[$i]\n";
}
}
}
close CLAB;
}
}
# writing testregex file
if(defined $opt_testregex)
{
$testregex=$opt_testregex;
if(-e $testregex)
{
print STDERR "Warning($0):
Test Regex file <$testregex> already exists, overwrite (y/n)? ";
$ans=<STDIN>;
}
if(!-e $testregex || $ans=~/Y|y/)
{
open(TESTREGEX,">$testregex") || die "Error($0):
Error(code=$!) in opening the Test Regex file <$testregex>\n";
# printing regexes
if (!defined $opt_dense) {
# in sparse mode, we need to check for dropping
# regexes for empty columns
for ($index=1; $index <= @features; $index++)
{
if ($col[$index] > 0) {
print TESTREGEX "/$features[$index-1]/" . " \@name=$clabels[$index-1]\n";
}
}
} else {
# in dense mode, output all column labels
for ($index=1; $index <= @features; $index++)
{
print TESTREGEX "/$features[$index-1]/" . " \@name=$clabels[$index-1]\n";
}
}
close TESTREGEX;
}
}
##############################################################################
# ==========================
# SUBROUTINE SECTION
# ==========================
#-----------------------------------------------------------------------------
#show minimal usage message
sub showminimal()
{
print "Usage: order1vec.pl [OPTIONS] SVAL2 FEATURE_REGEX";
print "\nTYPE order1vec.pl --help for help\n";
}
#-----------------------------------------------------------------------------
#show help
sub showhelp()
{
print "Usage: order1vec.pl [OPTIONS] SVAL2 FEATURE_REGEX
Displays the first order context vectors of the instances in the given SVAL2
file.
SVAL2
A tokenized, preprocessed and well formatted Senseval-2 instance file.
FEATURE_REGEX
A file containing Perl regular expressions for features as created
by nsp2regex.pl.
OPTIONS:
--binary
Displays binary context vectors that show mere presence or absence of
features in the contexts. By default, frequency vectors are displayed.
--dense
Displays dense context vectors. By default, context vectors will have
sparse format.
--rlabel RLABELFILE
Writes row labels (instance ids) to the RLABELFILE which can be given
to vcluster's --rlabelfile option.
--rclass RCLASSFILE
Writes sense ids to the RCLASSFILE which can be given to vcluster's
--rclassfile option.
This option cannot be specified when --transpose is specified.
--clabel CLABELFILE
Writes column labels (features) to the CLABELFILE which can be given
to vcluster's --clabelfile option.
--transpose
Creates feature vectors instead of the default context vectors. The
output is a Latent Semantic Analysis style feature-by-context matrix,
instead of the default context-by-feature matrix that is native to
SenseClusters. As a result, the contents of the RLABELFILE and
CLABELFILE are swapped, i.e. the list of features is output to the
RLABELFILE and the list of contexts is output to the CLABELFILE.
--testregex TEST_REGEX
Creates a TEST_REGEX file containing only those regular expressions
from the input FEATURE_REGEX file that matched at least once in the
input SVAL2 file. This list can be different from the original list
in FEATURE_REGEX when different training data has been used to
identify features or when a different scope has been used for
training and test data creation.
This option is required when the --transpose option is specified.
--showkey
Displays the system generated KEY file name on the first line.
This option cannot be specified when --transpose is specified.
--target TARGET_REGEX
Specify a file containing Perl regex/s that define the target word
in SVAL2. By default, target.regex is assumed to exist in current
directory.
--extarget
Excludes the target word from features if the target word as
specified by --target or default target.regex, is listed in the
FEATURE_REGEX file.
Other Options:
--help
Displays this message.
--version
Displays the version information.
Type 'perldoc order1vec.pl' to view detailed documentation of order1vec.\n";
}
#------------------------------------------------------------------------------
#version information
sub showversion()
{
print '$Id: order1vec.pl,v 1.48 2008/03/30 04:40:58 tpederse Exp $';
print "\nConvert Senseval-2 contexts into first order feature vectors\n";
# print "\nCopyright (c) 2002-2006, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, & Mahesh Joshi\n";
# print "order1vec.pl - Version 0.08\n";
# print "Displays the first order context vectors.\n";
# print "Date of Last Update: 03/04/2005\n";
}
#############################################################################