NAME

WordNet::Similarity - Perl modules for computing measures of semantic relatedness.

SYNOPSIS

Basic Usage Example

  use WordNet::QueryData;

  use WordNet::Similarity::path;

  my $wn = WordNet::QueryData->new;

  my $measure = WordNet::Similarity::path->new ($wn);

  my $value = $measure->getRelatedness("car#n#1", "bus#n#2");

  my ($error, $errorString) = $measure->getError();

  die $errorString if $error;

  print "car (sense 1) <-> bus (sense 2) = $value\n";

Using a configuration file to initialize the measure

  use WordNet::Similarity::path;

  my $sim = WordNet::Similarity::path->new($wn, "mypath.cfg");

  my $value = $sim->getRelatedness("dog#n#1", "cat#n#1");

  ($error, $errorString) = $sim->getError();

  die $errorString if $error;

  print "dog (sense 1) <-> cat (sense 1) = $value\n";

Printing traces

  print "Trace String -> ".($sim->getTraceString())."\n";

DESCRIPTION

Introduction

We observe that humans find it extremely easy to say if two words are related and if one word is more related to a given word than another. For example, if we come across two words, 'car' and 'bicycle', we know they are related as both are means of transport. Also, we easily observe that 'bicycle' is more related to 'car' than 'fork' is. But is there some way to assign a quantitative value to this relatedness? Some ideas have been put forth by researchers to quantify the concept of relatedness of words, with encouraging results.

Eight of these different measures of relatedness have been implemented in this software package. A simple edge counting measure and a random measure have also been provided. These measures rely heavily on the vast store of knowledge available in the online electronic dictionary -- WordNet. So, we use a Perl interface for WordNet called WordNet::QueryData to make it easier for us to access WordNet. The modules in this package REQUIRE that the WordNet::QueryData module be installed on the system before these modules are installed.

Function

The following function is defined:

addConfigOption ($name, $required, $type, $default_val)

Adds the configuration option, $name, to the list of known config options (cf. configure()). If $required is true, then the option requires a value; otherwise, the value is optional, and the default value $default_val is used if a value is not specified in the config file. $type is the type of value the option takes. It can be 'i' for integer, 'f' for floating-point, 's' for string, or 'p' for a file name.

returns: nothing, but will die on error. You can put the call to this function in an eval block to trap the exception (N.B., the eval BLOCK form of eval does not significantly degrade performance, unlike the eval EXPR form of eval. See perldoc -f eval).

Methods

The following methods are defined in this package:

Public methods

$obj->new ($wn, $config_file)

The constructor for WordNet::Similarity::* objects.

Parameters: $wn is a WordNet::QueryData object, $config_file is a configuration file (optional).

Return value: the new blessed object

$obj->initialize ($config_file)

Performs some initialization on the module.

Parameter: the location of a configuration file

Returns: nothing

$obj->configure($config_file)

Parses a configuration file.

If you write a module and want to add a new configuration option, you can use the addConfigOption function to specify the name and nature of the option.

The value of the option is place in "self": $self->{optionname}.

parameter: a file name

returns: true if parsing of config file was successful, false on error

$obj->getTraceString()

Returns the current trace string and resets the trace string to empty. If tracing is turned off, then an empty string will always be returned.

$obj->getError()

Checks to see if any errors have occurred. Returns a list of the form ($level, $string). If $level is 0, then no errors have occurred; if $level is non-zero, then an error has occurred. A value of 1 is considered a warning, and a value of 2 is considered an error. If $level is non-zero, then $string will have a (hopefully) meaningful error message.

$obj->traceOptions()

Prints module-specific options to the trace string. Any module that adds configuration options via addConfigOption should override this method.

Options should be printed out using the following format:

  $self->{traceString} .= "option_name :: $option_value\n"

Note that the option name is separated from its current value by a space, two colons, and another space. The string should be terminated by a newline.

Since multiple modules may be overriding this method, any module that overrides this method should insure that the superclass' method gets called as well. You do this by putting this line at the end of your method:

  $self->SUPER::traceOptions();

returns: nothing

$obj->parseWps($synset1, $synset2)

parameters: synset1, synset2

returns: a reference to an array [$word1, $pos1, $sense1, $offset1, $word2, $pos2, $sense2, $offset2] or undef

This method checks the format of the two input synsets by calling validateSynset() for each synset.

If the synsets are in wps format, a reference to an array will be returned. This array has the form [$word1, $pos1, $sense1, $offset1, $word2, $pos2, $sense2, $offset2] where $word1 is the word part of $wps1, $pos1, is the part of speech of $wps1, $sense1 is the sense from $wps. $offset1 is the offset for $wps1.

If an error occurs (such as a synset being poorly-formed), then undef is returned, the error level is set to non-zero, and an error message is appended to the error string.

$obj->validateSynset($synset)

parameter: synset

returns: a list or undef on error

synset is a string in word#pos#sense format

This method does the following:

  1. Verifies that the synset is well-formed (i.e., that it consists of three parts separated by #s, the pos is one of {n, v, a, r} and that sense is a natural number). A synset that matches the pattern '[^\#]+\#[nvar]\#\d+' is considered well-formed.

  2. Checks if the synset exists by trying to find the offset for the synset

If any of these tests fails, then the error level is set to non-zero, a message is appended to the error string, and undef is returned.

If the synset is well-formed and exists, then a list is returned that has the format ($word, $pos, $sense, $offset).

$obj->getRelatedness($synset1, $synset2)

parameters: synset1, synset2

returns: a relatedness score

This is a virtual method. It must be overridden by a module that is derived from this class. This method takes two synsets and returns a numeric value as their score of relatedness.

$obj->printSet ($pos, $mode, @synsets)

If tracing is turned on, prints the contents of @synsets to the trace string. The contents of @synsets can be either wps strings or offsets. If they are wps strings, then $mode must be the string 'wps'; if they are offsets, then the mode must be 'offset'. Please don't try to mix wps and offsets.

Returns the string that was appended to the trace string.

$obj->fetchFromCache($wps1, $wps2, $non_symmetric)

Looks for the relatedness value of ($wps1, $wps2) in the cache. If $non_symmetric is false (or isn't specified), then the cache is searched for ($wps2, $wps1) if ($wps1, $wps2) isn't found.

Returns: a relatedness value or undef if none found in the cache.

$obj->storeToCache ($wps1, $wps2, $score)

Stores the relatedness value, $score, of ($wps1, $wps2) to the cache.

Returns: nothing

Discussion

This package consists of Perl modules along with supporting Perl programs that implement the semantic relatedness measures described by Leacock Chodorow (1998), Jiang Conrath (1997), Resnik (1995), Lin (1998), Wu Palmer (1993), Hirst St-Onge (1998) the Extended Gloss Overlaps measure by Banerjee and Pedersen (2002) and a Gloss Vector measure recently introduced by Patwardhan and Pedersen. The package contains Perl modules designed as object classes with methods that take as input two word senses. The semantic distance between these word senses is returned by these methods. A quantitative measure of the degree to which two word senses are related has wide ranging applications in numerous areas, such as word sense disambiguation, information retrieval, etc. For example, in order to determine which sense of a given word is being used in a particular context, the sense having the highest relatedness with its context word senses is most likely to be the sense being used. Similarly, in information retrieval, retrieving documents containing highly related concepts are more likely to have higher precision and recall values.

A command line interface to these modules is also present in the package. The simple, user-friendly interface simply returns the relatedness measure of two given words. Number of switches and options have been provided to modify the output and enhance it with trace information and other useful output. Support programs for generating information content files from various corpora are also available in the package. The information content files are required by three of the measures for computing the relatedness of concepts. There is also a tool to find the depths of the taxonomies in WordNet.

Configuration files

The behavior of the measures of semantic relatedness can be controlled by using configuration files. These configuration files specify how certain parameters are initialized within the object. A configuration file may be specified as a parameter during the creation of an object using the new method. The configuration files must follow a fixed format.

Every configuration file starts with the name of the module ON THE FIRST LINE of the file. For example, a configuration file for the res module will have on the first line 'WordNet::Similarity::res'. This is followed by the various parameters, each on a new line and having the form 'name::value'. The 'value' of a parameter is optional (in case of boolean parameters). In case 'value' is omitted, we would have just 'name::' on that line. Comments are supported in the configuration file. Anything following a '#' is ignored in the configuration file.

Sample configuration files are present in the '/samples' subdirectory of the package. Each of the modules has specific parameters that can be set/reset using the configuration files. Please read the manpages or the perldocs of the respective modules for details on the parameters specific to each of the modules. For instance, 'man WordNet::Similarity::res' or 'perldoc WordNet::Similarity::res' should display the documentation for the Resnik module. The module parses the configuration file and recognizes the following parameters:

trace

This option is supported by all measures.

The value of this parameter specifies the level of tracing that should be employed for generating the traces. This value is an integer equal to 0, 1, or 2. If the value is omitted, then the default value, 0, is used. A value of 0 switches tracing off. A value of 1 or 2 switches tracing on. The difference between a value of 1 or 2 depends upon the measure being used.

For vector and lesk, a value of 1 displays as traces only the gloss overlaps found. A value of 2 displays as traces all the text being compared.

For the res, lin, jcn, wup, lch, path, and hso measures, a trace of level 1 means the synsets are represented as word#pos#sense strings, while for level 2, the synsets are represented as word#pos#offset strings.

cache

This option is supported by all measures.

The value of this parameter specifies whether or not caching of the relatedness values should be performed. This value is an integer equal to 0 or 1. If the value is omitted, then the default value, 1, is used. A value of 0 switches caching 'off', and a value of 1 switches caching 'on'.

maxCacheSize

This option is supported by all measures.

The value of this parameter indicates the size of the cache, used for storing the computed relatedness value. The specified value must be a non-negative integer. If the value is omitted, then the default value, 5,000, is used. Setting maxCacheSize to zero has the same effect as setting cache to zero, but setting cache to zero is likely to be more efficient. Caching and tracing at the same time can result in excessive memory usage because the trace strings are also cached. If you intend to perform a large number of relatedness queries, then you might want to turn tracing off.

Usage

The semantic relatedness modules in this distribution are built as classes. The classes define four methods that are useful in finding relatedness values for pairs of synsets.

  new()
  getRelatedness()
  getError()
  getTraceString()

Typical Usage Examples

To create an object of the Resnik measure, we would have the following lines of code in the Perl program.

   use WordNet::Similarity::res;
   $object = WordNet::Similarity::res->new($wn, '~/resnik.conf');

The reference of the initialized object is stored in the scalar variable '$object'. '$wn' contains a WordNet::QueryData object that should have been created earlier in the program. The second parameter to the 'new' method is the path of the configuration file for the resnik measure. If the 'new' method is unable to create the object, '$object' would be undefined. This, as well as any other error/warning may be tested.

   die "Unable to create resnik object.\n" unless defined $object;
   ($err, $errString) = $object->getError();
   die $errString."\n" if($err);

To create a Leacock-Chodorow measure object, using default values, i.e. no configuration file, we would have the following:

   use WordNet::Similarity::lch;
   $measure = WordNet::Similarity::lch->new($wn);

To find the semantic relatedness of the first sense of the noun 'car' and the second sense of the noun 'bus' using the resnik measure, we would write the following piece of code:

   $relatedness = $object->getRelatedness('car#n#1', 'bus#n#2');

To get traces for the above computation:

   print $object->getTraceString();

However, traces must be enabled using configuration files. By default traces are turned off.

AUTHORS

  Ted Pedersen, University of Minnesota Duluth
  tpederse at d.umn.edu

  Siddharth Patwardhan, University of Utah, Salt Lake City
  sidd at cs.utah.edu

  Jason Michelizzi, Univeristy of Minnesota Duluth
  mich0212 at d.umn.edu

  Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh
  banerjee+ at cs.cmu.edu

BUGS

None.

To submit a bug report, go to http://groups.yahoo.com/group/wn-similarity or send e-mail to tpederse at d.umn.edu.

SEE ALSO

perl(1), WordNet::Similarity::jcn(3), WordNet::Similarity::res(3), WordNet::Similarity::lin(3), WordNet::Similarity::lch(3), WordNet::Similarity::hso(3), WordNet::Similarity::lesk(3), WordNet::Similarity::wup(3), WordNet::Similarity::path(3), WordNet::Similarity::random(3), WordNet::Similarity::ICFinder(3), WordNet::Similarity::PathFinder(3) WordNet::QueryData(3)

http://www.cs.utah.edu/~sidd

http://wordnet.princeton.edu

http://www.ai.mit.edu/~jrennie/WordNet

http://groups.yahoo.com/group/wn-similarity

COPYRIGHT

Copyright (c) 2005, Ted Pedersen, Siddharth Patwardhan, Jason Michelizzi and Satanjeev Banerjee

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

    The Free Software Foundation, Inc.,
    59 Temple Place - Suite 330,
    Boston, MA  02111-1307, USA.

Note: a copy of the GNU General Public License is available on the web at http://www.gnu.org/licenses/gpl.txt and is included in this distribution as GPL.txt.