The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

nsp2regex.pl - Convert Text-NSP output into regular expressions to be used for feature matching

SYNOPSIS

 nsp2regex.pl [OPTIONS] SOURCE [[, SOURCE] ...]

DESCRIPTION

Takes n-word sequences and represents them as regular expressions. These can then be used to identify lexical features in a given data, and convert a lexical element files from text into feature vectors.

INPUT

Required Arguments:

SOURCE

The SOURCE is a file containing the list of features. The features are required to be in specific format:

 the_feature_token<>

 Unigram feature: temperature<>
 Bigram feature: daily<>temperature<>
 

count.pl or statistic.pl (both part of the Ngram Statistics Package) created output can be directly used as the SOURCE file.

Optional Arguments

--token FILE

Uses tokens contained in FILE to create the separator between tokens, when window size of SOURCE n-gram is greater than the 'n' of the n-gram. Window sizes for n-grams in SOURCE can be defined using the --extended option in count.pl.

--version

Prints the version number.

--help

Prints this help message.

OUTPUT

Outputs the generated regular expressions to stdout.

Explanation of the created Regular Expressions

Default Regular Expression (without Skipping Intermediate Tokens):

By default nsp2regex.pl creates regex's that match space separated tokens. The regular expressions that nsp2regex.pl creates are based on the assumption that the text on which these regex's are going to be used has tokens separated by a single space. Further the regular expressions thus created ignore XML tags and non-tokens, as described in the examples above.

For example, the following line in the input to nsp2regex.pl:

 a<>bigram<>

is converted to the following regex:

 /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram

In this output, everything from the first / to the last / constitutes the regular expression. The portion "@name = a<>bigram" is used by xml2arff.pl (from SenseTools package) for giving a name to the attribute corresponding to this regular expression.

What This Regular Expression will Match:

This regular expression defines a feature that will match the tokens "a" and "bigram" under the following conditions:

 i>   Tokens "a" and "bigram" have exactly one space to their left and
      right. For example, this regex will match the sentence " this is a
      bigram ". This regex will not match the sentence " i wanna bigram "
      nor the sentence " i have a bigrams ". It will not even match " I
      have a    bigram ". This is because nsp2regex.pl creates regular
      expressions that assume that there is exactly ONE space character
      between tokens!

 ii>  Tokens "a" and "bigram" are bounded by one or more xml tags or
      non-tokens, that is a sequence of characters that start with '<'
      and end with '>'. eg: this regex will match the sentence : " this
      is a <head>bigram</head> ". This regex will also match " this is
      a <head>bigram<senseid=20/></head> ". 

 iii> tokens "a" and "bigram" are separated by one or more space
      separated xml tags.  eg: this regex will match the sentence " this
      is a <,> bigram ". It will also match " this is a <,> bigram <!>
      " and " this is a <,> <head>bigram</head> ". 

 iv>  combinations of the above cases. 

Explanation of this Regular Expression:

Following is an explanation of the various parts of the regular expression:

 /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram


 a> All the portion between the first '/' and the last '/' is the regular
    expression. 

 b> The regular expression starts with requiring a single space
    character, \s. This is consistent with the assumption that every
    token has exactly one space to its left and one to its right.

 c> The next chunk is (<[^>]*>)*a(<[^>]*>)*
    Note that the portion (<[^>]*>) represents exactly our definition
    of an XML tag, namely that it should start with a '<', have 0 or
    more characters, except the '>' character, and then end with the
    '>' character. The '*' outside the bracket denotes that we are
    willing to match 0 or more such tags. After that, we wish to match
    a single occurrence of the first token, 'a', again followed by 0 or
    more tags. Note that the tags are "stuck" to the token 'a', in that
    there is no space between the tag and the token 'a'. Of course if
    in the text there is a space between an XML tag and 'a', then the
    space would match the space in <b> above. 

 d> Having matched token 'a' with 0 or more tags "stuck" to its right
    and left, we now wish to match exactly a single space character
    through the \s. Again this corresponds to our assumption that
    tokens in the text are separated by exactly one space character!

 e> The next chunk (<[^>]*>\s)* is again our familiar XML tag. This
    time we wish to "skip" over 0 or more occurrences of any XML tag
    that lie between the first and the second token, ie between 'a' and
    'bigram'. Since these are not "stuck" to the next token 'bigram',
    they are space separated from each other and from 'bigram'. Hence,
    for every token we match, we also match a space character!

 f> The next chunk is (<[^>]*>)*bigram(<[^>]*>)* which is exactly like
    the chunk for 'a' in point <c> above. 

 g> Finally we wish to match a single space character \s.

 h> The portion after the last '/' @name = a<>bigram creates a "name"
    for this feature. This name is used by xml2arff (from SenseTools 
    package) while creating the vector output of the input XML file. 
    While this name is not necessary, it makes the vector output more 
    human-readable.

Regular Expression with Skipping of Intermediate Tokens:

nsp2regex.pl can create regular expressions that ignore one or more tokens that occur between the tokens to be matched. This can be switched "ON" by having the directive "@count.WindowSize=..." in the input file to nsp2regex.pl. We need to provide nsp2regex.pl with the same token file we provide preprocess.pl... say following is the token file:

 /<head>\w+<\/head>/
 /\w+/

Let the input file to the nsp2regex.pl program be the following:

 @count.WindowSize=3
 a<>bigram<>

then, the output regular expression from nsp2regex.pl is:

/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1

What This Regular Expression will Match:

This regular expression will match the tokens "a" and "bigram" separated by 0 or 1 occurrences of the white space separated token ((<head>\w+<\/head>)|(\w+)). This is the token definitions obtained from the token.txt file above!

For example, this regular expression will match the following sentences:

 " this is a funny bigram "
 " this is a bigram "
 " this is a <head>nice</head> bigram "
 " this is a <,> bigram "
 " this is a <,> <head>nice</head> bigram "

This regular expression will not match:

 " this is a really big bigram ",
 " i wanna write bigram ".
 " this is a , bigram ",

Explanation of this Regular Expression:

Following is a description of various parts of the regular expression:

/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1

On careful observation one will notice that the above regular expression differs from the previous regular expression (section 6.1.2) in only one portion.

Specifically the portion \s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)* is the same as above... this matches a space, followed by 'a' with XML tags or non-token characters (within <> brackets) stuck to its left and right, followed by a single space, followed by 0 or more XML tags and non-token characters, with a space after every such tag.

Further note that the portion (<[^>]*>)*bigram(<[^>]*>)*\s is again the same as before... they match 'bigram' with XML tags and non-token character tags stuck to its left and right, followed by a single space.

Thus the only "new" portion in this regex is

 ((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}

We call this the "separator" portion of the regex; this is the portion that allows for the "ignoring" of up to one token between the tokens 'a' and 'bigram'. This token can be either a <head>\w+</head> or a \w+.

 a> Observe that the entire section is within a pair of round brackets,
    followed by a {0,1}. This says that this portion is allowed to
    occur 0 or 1 times. This is consistent with the window size of
    3... besides 'a' and 'bigram', we allow at most one other token to
    come into the window. If our window size were to be 10 say, this
    would be {0,8}.

 b> The first part inside this bracketed portion is 
    (<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*. This says that we
    are willing to match either a <head>\w+</head> or a \w+. Further
    whatever we match can be preceeded or followed by an XML tag or a
    non-token character ensconced with the angular brackets <>. 

 c> Having matched either of the two options, we wish to match a single
    space, \s, followed by one or more XML tags or non-tokens, in
    keeping with our desire to skip these tags!

 e> And, as mentioned in <a> above, we would like to do this matching
    at most once, that is there will be at most one such token between
    'a' and 'bigram'. 

 f> The name of the feature has also changed to @name = a<>bigram<>1
    implying that we are allowing at most one token to come in between
    our two main tokens!

A Fine Point about nsp2regex.pl:

Fine Point 1: Certain characters, like '.', '*', '?' etc have special meaning when used within a regular expression. If these characters occur in the tokens that the regular expression is being built from, they are "escaped" (by prepending them with a slash '\'). Following is a list of characters that are so escaped: '\', '/', '|', '(', ')', '[', ']', '{', '}', '^', '$', '*', '+', '?' and '.'

AUTHORS

 Satanjeev Banerjee, Carnegie-Mellon University

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2001-2008, Satanjeev Banerjee and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.