Ted Pedersen > Text-SenseClusters-1.03 > order1vec.pl

Download:
Text-SenseClusters-1.03.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

order1vec.pl - Convert Senseval-2 format contexts into first order feature vectors in Cluto format

SYNOPSIS ^

 order1vec.pl [OPTIONS] SVAL2 FEATURE_REGEX

Type order1vec.pl --help for a quick summary of options

DESCRIPTION ^

Convert a context into a first order feature vector which shows how which features occured in the contexts. The possible features are identified via Perl regular expressions of the form created by nsp2regex.pl.

INPUT ^

Required Arguments:

SVAL2

A tokenized, preprocessed and well formatted Senseval-2 instance file showing instances whose context vectors are to be generated.

Context of each instance should be delimited within <context> and </context> tags. It is required that each XML tag in the Senseval-2 file appears on a separate line. Tokens should be space separated.

FEATURE_REGEX

A file containing Perl regular expressions for features as created by nsp2regex.pl.

Sample FEATURE_REGEX files -

  1.  /\s(<[^>]*>)*time(<[^>]*>)*\s/ @name = time
     /\s(<[^>]*>)*task(<[^>]*>)*\s/ @name = task
     /\s(<[^>]*>)*believe(<[^>]*>)*\s/ @name = believe
     /\s(<[^>]*>)*life(<[^>]*>)*\s/ @name = life
     /\s(<[^>]*>)*control(<[^>]*>)*\s/ @name = control
     /\s(<[^>]*>)*words(<[^>]*>)*\s/ @name = words
     /\s(<[^>]*>)*define(<[^>]*>)*\s/ @name = define

    Explanation :

    1. The above FEATURE_REGEX file shows total 7 unigram features, single feature on each line.
    2. Feature names are shown by "@name = FEATURE_NAME" that follows the actual feature regex/s.
    3. Tokens in the SVAL2 file should be separated by exactly one blank space. Any non-tokens if exist should be put inside the angular brackets e.g. <item>, <sat>
  2.  /\s(<[^>]*>)*personal(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*computer(<[^>]*>)*\s/ @name = personal<>computer
     /\s(<[^>]*>)*stock(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*market(<[^>]*>)*\s/ @name = stock<>market
     /\s(<[^>]*>)*electronic(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*systems(<[^>]*>)*\s/ @name = electronic<>systems
     /\s(<[^>]*>)*toll(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*free(<[^>]*>)*\s/ @name = toll<>free

    Shows a bigram feature file in which each feature includes two tokens separated by single space or any number of non-token sequences in <> brackets.

    More explanation on feature regex creation is given in the perldoc of the nsp2regex program.

    NOTE: Null columns are discarded i.e. the features which do not occur in any of the contexts are dropped, and when --transpose option is specified (see below for details), contexts that do not contain any features are dropped as well.

Optional Arguments:

--binary

By default, order1vec creates frequency context vectors that show how many times each feature occurs in the context. --binary will instead create binary context vectors where 1 indicates presence of feature and 0 indicates absence of feature in the context.

--dense

By default, context vectors will have sparse format. --dense will display output context vectors in dense format.

--rlabel RLABELFILE

Creates a RLABELFILE containing row labels for Cluto's --rlabelfile option. Each line in the RLABELFILE shows an instance id of the instance whose context vector is shown on the corresponding line on STDOUT.

Instance ids are extracted from the SVAL2 file by matching regex

                /instance id\s*=\s*"IID"/

where 'IID' is an instance id of the <context> that follows this <instance> tag.

NOTE: When the --transpose option is specified, the contents of the RLABELFILE and the CLABELFILE are swapped.

--rclass RCLASSFILE

Creates RCLASSFILE for Cluto's --rclassfile option. Each line in the RCLASSFILE shows true sense id of the instance whose context vector appears on the corresponding line on STDOUT.

Sense ids are extracted from the SVAL2 file by matching regex

                /sense\s*id\s*=\s*"SID"\/>/

where SID shows a true sense tag of the instance whose IID is recently extracted by matching

                /instance id\s*=\s*"IID"/

This option cannot be specified when the --transpose option is specified.

--clabel CLABELFILE

Creates a CLABELFILE containing column labels for Cluto's --clabelfile option. Each line in the CLABELFILE shows a feature representing corresponding column of the output context vectors.

Features are extracted from the FEATURE_REGEX file by matching string "@name = FEATURE" where FEATURE shows the feature name.

NOTE: When the --transpose option is specified, the contents of the RLABELFILE and the CLABELFILE are swapped.

--transpose

Creates feature vectors instead of the default context vectors. The output is a Latent Semantic Analysis style feature-by-context matrix, instead of the default context-by-feature matrix that is native to SenseClusters. As a result, the contents of the RLABELFILE and CLABELFILE are swapped, i.e. the list of features is output to the RLABELFILE and the list of contexts is output to the CLABELFILE.

--testregex TEST_REGEX

Creates a TEST_REGEX file containing only those regular expressions from the input FEATURE_REGEX file that matched at least once in the input SVAL2 file. This list can be different from the original list in FEATURE_REGEX when different training data has been used to identify features or when a different scope has been used for training and test data creation.

This option is required when the --transpose option is specified, in order to ensure creation of a compatible TEST_REGEX file that corresponds to the output of order1vec.pl in --transpose mode, so that both the output and the TEST_REGEX can be directly passed as inputs to the order2vec.pl program.

--showkey

Displays the name of a system generated KEY file on the first line of STDOUT. KEY file preserves the instance ids and sense tags of the instances in the given SVAL2 file. This information will be automatically used by some of the clustering and evaluation programs in SenseClusters that operate on purely numeric instance formats. The option should be selected if the user is planning to run SenseClusters' clustering code.

This option cannot be specified when the --transpose option is specified, as no KEY file is generated in --transpose mode.

--target TARGETREGEX

Specifies a file containing Perl regex/s that define the target word. By default, target.regex file is assumed to exist in the current directory.

--extarget

This will exclude the target word from features if the target word (as specified by the --target option or default target.regex file) appears in the FEATURE_REGEX file. In other words, the feature dimensions of the output context vectors will not include the target word even if target word is listed in the FEATURE_REGEX file.

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT ^

KEY file

When --transpose is not specified, order1vec automatically generates a KEY file that preserves the instance ids and sense tags of the SVAL2 instances.

Each line in the KEY file shows an instance id and one or more sense tags of the instance represented by a context vector on the corresponding line on STDOUT. i.e. the ith line in the KEY file shows the instance and sense ids of the ith instance in the SVAL2 file or the ith vector displayed on stdout.

Sample KEY file looks like

 <instance id="line-n.w8_020:7099:"/> <sense id="phone"/>
 <instance id="line-n.w8_132:15431:"/> <sense id="phone"/>
 <instance id="line-n.w8_027:13762:"/> <sense id="phone"/>
 <instance id="line-n.w7_114:8965:"/> <sense id="text"/>
 <instance id="line-n.w7_065:1553:"/> <sense id="product"/>
 <instance id="line-n.w9_4:9437:"/> <sense id="product"/>

Or

 <instance id="line-n.w8_020:7099:"/> <sense id="NOTAG"/>
 <instance id="line-n.w7_111:238:"/> <sense id="NOTAG"/>
 <instance id="line-n.w7_011:12078:"/> <sense id="NOTAG"/>
 <instance id="line-n.w7_095:17576:"/> <sense id="NOTAG"/>
 <instance id="line-n.w7_080:10129:"/> <sense id="NOTAG"/>
 <instance id="line-n.w9_4:2358:"/> <sense id="NOTAG"/>

when the sense ids of instances are not available in the input SVAL2 file.

Or

 <instance id="hard-a.sjm-180_1:"/> <sense id="HARD1"/> <sense id="HARD2"/>
 <instance id="hard-a.br-l15:"/> <sense id="HARD1"/>
 <instance id="hard-a.sjm-242_12:"/> <sense id="HARD2"/>
 <instance id="hard-a.sjm-070_4:"/> <sense id="HARD1"/> <sense id="HARD3"/>
 <instance id="hard-a.sjm-168_4:"/> <sense id="HARD3"/>

when some instances have multiple sense tags.

Context Vectors on STDOUT (when --transpose is NOT specified)

Sparse Format (SenseClusters Native Representation)

By default (unless --dense is specified), output vectors will be created in sparse format.

The first line on stdout will show 3 numbers separated by blanks as

N M NNZ

where

 N = Number of instances in SVAL2 file

 M = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file

 NNZ = Total number of non-zero entries in all sparse vectors

Each line thereafter shows a single sparse context vector on each line. In short, every ith line after the 1st line shows the context vector of the i'th instance in the given SVAL2 file.

Each sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the index of a non-zero value in the vector and the second number is a non-zero value itself corresponding to that index.

Sample Sparse Output

 12 18 31
 1 1 2 1
 1 1 2 2 3 2 4 1
 4 1
 5 1 6 2
 5 2 6 3 7 1 8 2 9 1
 9 1
 7 1
 8 1 10 1
 4 2 11 3 12 2 13 4 14 1
 15 1
 14 1 15 1
 3 1 8 1 16 4 17 4 18 4

Note that,

  1. First Line shows that there are total 12 sparse vectors, represented using total 18 features, and total 31 non-zero values.
  2. Each vector (all lines except the 1st line) is a list of 'index value' pairs separated by space. e.g. 1st vector (line 2) shows that features at indices 1 and 2 appear once in the 1st instance. 2nd vector (3rd line) shows that features at indices 1 and 4 appear once while those at indices 2 and 3 appear twice each in the 2nd instance.

    Feature indices start from 1, to be consistent with Cluto's matrix format standard.

  3. If --binary is set ON, all non-zero values will have value 1 showing mere presence of feature in the context rather than the frequency counts.

Dense Format (SenseClusters Native Representation)

When --dense option is selected, order1vec will create output in dense vector format.

First line on STDOUT will show exactly two numbers separated by space. The first number indicates the number of vectors and the second number indicates the number of features (dimensions of the context vectors).

Each line thereafter shows a single context vector such that ith line after the 1st line shows the context vector of the ith instance in the SVAL2 file.

Sample Dense Output

 12 18
 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 2 3 1 2 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 3 2 4 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 4 4 4

shows same context vectors as shown in Sample Sparse Format but in dense format.

Note that

1. All vectors have same length and is same as the number of features (here 18) from the given FEATURE_REGEX file that matched at least once in the SVAL2 file.
2. When --binary is ON, value at column j in a vector will be 1 for every feature j that is found at least once in the context.
3. When --binary is not used, value at column j in a vector shows the number of times the jth feature is found in the context.
4. A 0 at column j of any vector shows that the jth feature in the FEATURE_REGEX file doesn't appear in that context.

When --showkey is selected, output will be exactly same as described above except the first line will show the KEY file name that is required by the SenseClusters' programs.

e.g.

 <keyfile name="KEY"/>
 12 18 31
 1 1 2 1
 1 1 2 2 3 2 4 1
 4 1
 5 1 6 2
 5 2 6 3 7 1 8 2 9 1
 9 1
 7 1
 8 1 10 1
 4 2 11 3 12 2 13 4 14 1
 15 1
 14 1 15 1
 3 1 8 1 16 4 17 4 18 4

Shows same vectors as shown in Sample Sparse Output when --showkey is ON. Value of KEY shown in the <keyfile> tag will be the system generated KEY file name.

Features Vectors on STDOUT (when --transpose IS specified)

Note that --testregex TEST_REGEX is a required option when --transpose is specified.

Sparse Format (Latent Semantic Analysis Representation)

By default (unless --dense is specified), output vectors will be created in sparse format.

The first line on stdout will show 3 numbers separated by blanks as

 N M NNZ

where

 N = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file

 M = Number of instances in SVAL2 file, for which at least one feature was identified

 NNZ = Total number of non-zero entries in all sparse vectors

Each line thereafter shows a single sparse feature vector on each line. In short, every ith line after the 1st line shows the feature vector of the i'th feature in the created TEST_REGEX file.

Each sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the index of a non-zero value in the vector and the second number is a non-zero value itself corresponding to that index.

Sample Sparse Output (Transpose of the Context Vectors output above)

 18 12 31
 1 1 2 1
 1 1 2 2
 2 2 12 1
 2 1 3 1 9 2
 4 1 5 2
 4 2 5 3
 5 1 7 1
 5 2 8 1 12 1
 5 1 6 1
 8 1
 9 3
 9 2
 9 4
 9 1 11 1
 10 1 11 1
 12 4
 12 4
 12 4

Note that,

  1. First Line shows that there are total 18 sparse feature vectors, represented using total 12 contexts, and total 31 non-zero values.
  2. Each vector (all lines except the 1st line) is a list of 'index value' pairs separated by space. e.g. 1st vector (line 2) shows that contexts at indices 1 and 2 contain the 1st feature once each. 3rd vector (4th line) shows that context at index 2 contains the 3rd feature 2 times and the context at index 12 contains the 3rd feature once.

    Context indices start from 1, to be consistent with Cluto's matrix format standard.

  3. If --binary is set ON, all non-zero values will have value 1 showing mere presence of feature in the context rather than the frequency counts.

Dense Format (Latent Semantic Analysis Representation)

When --dense option is selected, order1vec will create output in dense vector format.

First line on STDOUT will show exactly two numbers separated by space. The first number indicates the number of vectors and the second number indicates the number of contexts (dimensions of the feature vectors).

Each line thereafter shows a single feature vector such that ith line after the 1st line shows the context vector of the ith instance in the SVAL2 file.

Sample Dense Output (Transpose of the dense output of Context Vectors above)

 18 12
 1 1 0 0 0 0 0 0 0 0 0 0 
 1 2 0 0 0 0 0 0 0 0 0 0 
 0 2 0 0 0 0 0 0 0 0 0 1 
 0 1 1 0 0 0 0 0 2 0 0 0 
 0 0 0 1 2 0 0 0 0 0 0 0 
 0 0 0 2 3 0 0 0 0 0 0 0 
 0 0 0 0 1 0 1 0 0 0 0 0 
 0 0 0 0 2 0 0 1 0 0 0 1 
 0 0 0 0 1 1 0 0 0 0 0 0 
 0 0 0 0 0 0 0 1 0 0 0 0 
 0 0 0 0 0 0 0 0 3 0 0 0 
 0 0 0 0 0 0 0 0 2 0 0 0 
 0 0 0 0 0 0 0 0 4 0 0 0 
 0 0 0 0 0 0 0 0 1 0 1 0 
 0 0 0 0 0 0 0 0 0 1 1 0 
 0 0 0 0 0 0 0 0 0 0 0 4 
 0 0 0 0 0 0 0 0 0 0 0 4 
 0 0 0 0 0 0 0 0 0 0 0 4

shows same context vectors as shown in Sample Sparse Format but in dense format.

Note that

1. All vectors have same length and is same as the number of contexts (here 12) from the given SVAL2 file that contained at least one feature from the TEST_REGEX file.
2. When --binary is ON, value at column j in a vector will be 1 for every context j that contains the feature at least once.
3. When --binary is not used, value at column j in a vector shows the number of times the feature is found in the jth context.
4. A 0 at column j of any vector shows that the feature doesn't appear in the jth context.

SYSTEM REQUIREMENTS ^

Math::SparseVector - http://search.cpan.org/dist/Math-SparseVector/ =back

BUGS ^

This program behaves unpredictably if the input file is not in Senseval2 format. No error message is given, and it will produce numeric output, but of course it has no real meaning. A check should be added to make sure the input file is in Senseval2 format.

AUTHOR ^

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare, University of Pittsburgh

 Anagha Kulkarni, Carnegie-Mellon University

 Mahesh Joshi, Carnegie-Mellon University

COPYRIGHT ^

Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
syntax highlighting: