Ted Pedersen > WordNet-Similarity > UTILS

Download:
WordNet-Similarity-2.05.tar.gz

Annotate this POD

CPAN RT

New  6
Open  2
View/Report Bugs
Source  

NAME ^

UTILS - [documentation] WordNet::Similarity supporting utilities

SYNOPSIS ^

The '/utils' subdirectory of the package contains supporting Perl programs. 'similarity.pl' is a command-line interface to the relatedness modules. A number of Perl programs that generate information content files from various corpora are provided. 'wnDepths.pl' is a program that will generate files with the depths of WordNet synsets and the maximum depths of WordNet taxonomies. As part of the standard install, these are also installed into the system directories, and can be accessed from any working directory if the common system directories (/usr/bin, /usr/local/bin, etc.) are in your path.

DESCRIPTION ^

The '/utils' directory of the package contains a few support Perl programs, that use the WordNet::Similarity modules or generate data files for it. As part of the standard installation these are installed into the system directories (such as /usr/bin or /usr/local/bin) from where they can be easily accessed.

similarity.pl

The similarity.pl program provides a command-line interface to the relatedness modules.

Usage

Usage: similarity.pl [{--type TYPE [--config CONFIGFILE] [--allsenses] [--offsets] [--trace] [--wnpath PATH] [--simpath SIMPATH] {--interact --file FILENAME | WORD1 WORD2} |--help |--version }]

Displays the semantic similarity between the base forms of WORD1 and WORD2 using various similarity measures described in Budanitsky Hirst (2001). The parts of speech of WORD1 and/or WORD2 can be restricted by appending the part of speech (n, v, a, r) to the word. (For eg. car#n will consider only the noun forms of the word 'car' and walk#nv will consider the verb and noun forms of 'walk'). Individual senses of can also be given as input, in the form of word#pos#sense strings (For eg., car#n#1 represents the first sense of the noun 'car').

Options

--type

Switch to select the type of similarity measure to be used while calculating the semantic relatedness. The following strings are defined.

  L<WordNet::Similarity::path>         Simple node-counts (inverted).
  L<WordNet::Similarity::wup>          The Wu Palmer measure.
  L<WordNet::Similarity::lch>          The Leacock Chodorow measure.
  L<WordNet::Similarity::jcn>          The Jiang Conrath measure.
  L<WordNet::Similarity::res>          The Resnik measure.
  L<WordNet::Similarity::lin>          The Lin measure.
  L<WordNet::Similarity::hso>          The Hirst St-Onge measure.
  L<WordNet::Similarity::lesk>         Extended gloss overlap measure.
  L<WordNet::Similarity::vector>       Gloss Vector measure.
  LWordNet::Similarity::vector_pairs>  Gloss Vector measure (pairwise).
  L<WordNet::Similarity::random>       A random-number "measure".
--config

Module-specific configuration file CONFIGFILE. This file contains the configuration that is used by the WordNet::Similarity modules during initialization. The format of this file is specific to each modules and is specified in the module man pages and in the documentation of the WordNet::Similarity package. See config.pod for more details.

--allsenses

Displays the relatedness between every sense pair of the two input words WORD1 and WORD2.

--offsets

Displays all synsets (in the output, including traces) as synset offsets and part of speech, instead of the word#partOfSpeech#senseNumber format used by QueryData. With this option any WordNet synset is displayed as word#partOfSpeech#synsetOffset in the output.

--trace

Switches on 'Trace' mode. Displays as output on STDOUT, the various stages of the processing. This option overrides the trace option in the module configuration file (if specified).

--interact

Starts the interactive mode. Useful for demoes, debugging, and generally messing around with the measures.

--file

Allows the user to specify an input file FILENAME containing pairs of word whose semantic similarity needs to be measured. The file is assumed to be a plain text file with pairs of words separated by newlines, and the words of each pair separated by a space.

--wnpath

Option to specify the path of the WordNet data files as PATH. (Defaults to /usr/local/WordNet-2.0/dict on Unix systems and C:\WordNet\2.0\dict on Windows systems)

--simpath

If the relatedness module to be used, is locally installed, then SIMPATH can be used to indicate the location of the local install of the measure.

--help

Displays this help screen.

--version

Displays version information.

NOTE: The environment variables WNHOME and WNSEARCHDIR, if present, are used to determine the location of the WordNet data files. Use '--wnpath' to override this.

ANOTHER NOTE: For the information-content based measures, similarity.pl without the '--config' option invokes the relatedness modules using the default information content file generated during installation of the modules. If, however, the version of WordNet being used has changed since that time, or for some reason the modules are unable to locate the default information content files, then alternate information content files can be specified only via the configuration file. Utilities to generate information content files have been provided in the package. For the WordNet::Similarity::vector_pairs measure, it is mandatory to provide the location of a word vector data file and this can be only done by using the '--config' option. In short, the '--config' option is REQUIRED for the WordNet::Similarity::vector_pairs measure.

YET ANOTHER NOTE: During any given session, only one of three modes of input can be specified to the program -- command-line input (WORD1 WORD2), file input (--file option) or the interactive input (--interact option). If more than one mode of input is invoked at a given time, only one of those modes will work, according to the following levels of priority: interactive mode (--interact option) has highest priority. file input (--file option) has medium priority. command-line input (WORD1 WORD2) has lowest priority.

Compound words may also be given as input to similarity.pl. They may be specified using underscores for spaces (as in WordNet) or may be enclosed within double quotes.

For example:

 similarity.pl --type WordNet::Similarity::jcn school private_school

 similarity.pl --type WordNet::Similarity::lch "interest rate" bank

Here 'private school' and 'interest rate' are the compound words intended in the two examples, respectively.

ANOTHER NOTE: Using the '--file' option however, does not allow us to use both methods of entering compound words in the input file. The compound words in the input file may be entered only using underscores for spaces (the double quotes option is not available for input via the input file).

The part of speech of the input word(s) may be restricted to one or more parts of speech by appending '#' followed by a combination of one or more of 'n', 'v', 'a' or 'r' (for nouns, verbs, adjectives and adverbs) to the one or both words.

A particular sense of a particular word may also be specified as input in the word#pos#sense format. Here 'pos' is exactly one of 'n', 'v', 'a' or 'r'.

For example:

 similarity.pl --type WordNet::Similarity::jcn school#n child#n

 similarity.pl --type WordNet::Similarity::lesk "interest rate#n" bank#nv

 similarity.pl --type WordNet::Similarity::hso telephone talk#v

 similarity.pl --type WordNet::Similarity::vector_pairs word#n#2 newspaper#v

 similarity.pl --type WordNet::Similarity::random chat#n#1 talk#v#2

- Interpreting the output

In the simplest case interpreting the output is rather straightforward. This is the case when just the semantic relatedness of two words has been requested. The output, in this case, consists of the two words and the relatedness value. However, when the '--allsenses' option or the '--trace' option is specified, the program needs to display in the output, WordNet synsets. In order to do this, we decided to adopt the convention introduced by Jason Rennie in the WordNet::QueryData module to represent the WordNet synsets.According to this convention a synset is represented by

(1)

a representative word from that synset

(2)

its part of speech and

(3)

a number specifying the sense number of the word (in this synset)

For example, consider the synset (teacher, instructor) from the noun data file of WordNet. Here the words 'teacher' as well as 'instructor' are each in their first sense. Using the above convention this synset may be represented by 'teacher#n#1' or by 'instructor#n#1'.

Besides this, if '--offsets' command-line option is used, a small variation of the above convention is used that displays the offset of the synset (in the WordNet data file) instead of the sense number. The above synset could then be represented by 'teacher#n#8562747' or 'instructor#n#8562747', since 8562747 if the offset of this synset in the noun data file of WordNet 1.7.

The first convention was adopted as the default, since synset offsets vary between different versions of WordNet, while sense numbers of words would more or less remain constant.

Typical usage examples

(1)

Suppose you wanted to find the measure of relatedness between 'car' and 'bicycle', using the Jiang-Conrath measure.

    similarity.pl --type WordNet::Similarity::jcn car bicycle
(2)

Suppose you need to find the relatedness of the noun forms of 'comb' and 'hair' using the Leacock-Chodorow measure and also your WordNet database files happen to be located at /WordNet-2.0/dict, then you would have

    similarity.pl --type WordNet::Similarity::lch --wnpath /WordNet-2.0/dict comb#n hair#n

If the --wnpath option is not given, the program looks for the path to the data files in the WNHOME and the WNSEARCHDIR environment variables. If these have also not been specified, then by default the program assumes that the WordNet data files reside in the directory /usr/local/WordNet-2.0/dict on a Unix machine and in C:\WordNet\2.0\dict on a Windows machine.

(3)

An example using a data file as input to the program (using the Jiang-Conrath measure for this example)

    similarity.pl --type WordNet::Similarity::jcn --file testfile
(4)

Displaying relatedness between all senses of the two words along with traces.

    similarity.pl --type WordNet::Similarity::lch --allsenses --trace paper pencil
(5)

Displaying the relatedness between the verb form of 'talk' and all parts of speech of 'speaker', with traces using the extended gloss overlap measure.

 similarity.pl --type WordNet::Similarity::lesk --trace speaker talk#v
(6)

Using a configuration file "/home/sid/lesk.conf" to specify the configuration options to the WordNet::Similarity::lesk module.

    similarity.pl --type WordNet::Similarity::lesk --config /home/sid/lesk.conf duck fowl
(7)

To display version information.

    similarity.pl --version
(8)

To display detailed help.

    similarity.pl --help

Information content programs

Three of the measures provided within the package require information content values of concepts (WordNet synsets) for computing the semantic relatedness of concepts. We provide these measures with frequency counts of WordNet synsets computed from large corpora of text, in files called information content files. A number of programs have been provided in the '/utils' subdirectory to generate information content files from various different corpora of text available.

  L<BNCFreq.pl>        -- from the BNC corpus. 
  L<brownFreq.pl>      -- from the Brown corpus. 
  L<semCorFreq.pl>     -- from SemCor X.X (using the sense tags). 
  L<semCorRawFreq.pl>  -- from SemCor X.X (ignoring the sense tags). 
  L<treebankFreq.pl>   -- from the Treebank corpus.
  L<rawtextFreq.pl>    -- from raw text.

All the six have a similar interface, however there are slight differences in the way the programs are called on the command-line due to the differences in the organization and format of the various corpora. But the following sub-sections give the typical usage and examples of all these programs. Please use the '--help' switch of each of the programs for the exact usage and help.

Usage

 <utility> [{--outfile OUTFILE 
           [--stopfile STOPFILE] [--wnpath WNPATH] 
           [--resnik] [--smooth SCHEME] PATH
           | --help 
           | --version }]

Here <utility> is one of the Perl programs provided, that generates an information content file from a large corpus of text. This program computes the information content of concepts, by counting the frequency of their occurrence in a corpus. PATH specifies the files of the corpus or the root of the directory tree containing the text of the corpus. Each utility has a different way in which the input files may be specified to it. Please use

  <utility> --help

to get the <utility>-specific idiosyncrasies.

Options:

--outfile

Specifies the output file OUTFILE.

--stopfile

STOPFILE is a list of stop listed words that will not be considered in the frequency count.

--wnpath

Option to specify WNPATH as the location of WordNet data files. If this option is not specified, the program tries to determine the path to the WordNet data files using the WNHOME environment variable.

--resnik

To enable the counting of frequencies using the method described by Resnik [3]. This was the method of counting originally used. We implemented a different scheme of counting described in our publication (Patwardhan, Banerjee and Pedersen [9]).

--smooth

Specifies the smoothing to be used on the probabilities computed. SCHEME specifies the type of smoothing to perform. It is a string, which can be only be 'ADD1' as of now. Other smoothing schemes will be added in future releases.

--help

Displays this help screen.

--version

Displays version information.

A sample COMPFILE containing the list of compounds in WordNet 2.0 is present is the '/samples' subdirectory. A utility called compounds.pl has been provided in the '/utils' subdirectory. This utility generates a list of compounds present in your version of WordNet and can be used to generate a file containing the list of compounds in WordNet as follows:

    compounds.pl > compounds.dat

In this case compounds.pl detects the location of the WordNet data files using the WNHOME environment variable. If the WNHOME environment variable has not been set up it tries default locations (C:\Program Files\WordNet\2.0 on Windows and /usr/local/WordNet-2.0 on a Unix system). Another way to specify the location of the WordNet data files is by using the '--wnpath' option in compounds.pl, like so

    compounds.pl --wnpath /usr/local/WordNet-2.0 > compounds.dat

- The utility-specific idiosyncrasies

(a)

BNCFreq.pl -- This utility creates the information content file from the word frequencies counted from the British National Corpus. The data files in the BNC are XML tagged files present in a 2-level directory structure. In a typical BNC install, the data files of the BNC reside in /BNC-world/Texts subpath of the BNC installation. This is the path that needs to be specified to BNCFreq.pl to count the frequency of words:

    BNCFreq.pl [OPTIONS] /home/sid/BNC-world/Texts
(b)

brownFreq.pl -- The version of the Brown corpus that we used contained the data files in the BROWN1 and BROWN2 subdirectories of the Brown corpus installation. Both directories contain the same data formatted a little differently. These files are provided to the brownFreq.pl utility as a list of files (commonly specified by wildcards, as follows):

    brownFreq.pl [OPTIONS] /home/sid/Brown/BROWN1/*.TXT
(c)

semCorRawFreq.pl -- These information content files are computed using SemCor, but ignore the sense tags. SemCor X.X was downloaded from Dr. Rada Mihalcea's website http://www.cs.unt.edu/~rada/software.html and is a sense tagged subset of the Brown Corpus and the Red Badge of Courage. The tagged data files are present in the /brown1/tagfiles subdirectory of the extracted package. These files are provided to the semCorRawFreq.pl utility as a list of files (using wildcards):

    semCorRawFreq.pl [OPTIONS] /home/sid/semcorXX/brown1/tagfiles/*

where XX is the the version of semCor appropriate for the version of WordNet being used.

(d)

semCorFreq.pl -- These information content files are computed from SemCor X.X (using the sense tags). The word frequencies for these have already been computed and are distributed as a part of the standard distribution of WordNet. Thus only the location of the WordNet data files need be specified for this utility (using the WNHOME environment variable or the '--wnpath' option).

    semCorFreq.pl [OPTIONS]       
(e)

treebankFreq.pl -- This utility computes the information content files from the Treebank corpus (only the Wall Street Journal articles). The Wall Street Journal articles are usually present in the /raw/wsj subdirectory of the Treebank installation. Only this needs to be specified when using this utility:

    treebankFreq.pl [OPTIONS] /home/sid/treebank/raw/wsj
(f)

rawtextFreq.pl -- To compute the information content files from raw text, only the raw text file(s) need to be specified as the input. A good source of raw text files is the Gutenberg Project: http://www.gutenberg.org

Some typical examples

(1)

In order to generate the information content file from the BNC, we type the command:

 BNCFreq.pl --outfile infoBNC.dat /home/sid/BNC-world/Texts

Here '/home/sid/BNC-world/Texts' is the path containing the BNC. Ouptut information content file is infoBNC.dat.

(2)

Frequency counts generated from the Brown corpus, using a stop-list.

 brownFreq.pl --outfile infoBrown.dat --stopfile stop.txt /home/sid/Brown/*

Uses the file 'stop.txt' containing stop words -- words that are ignored while counting the frequencies.

(3)

Frequency counts generated from a raw text file, using Resnik counting.

 rawtextFreq.pl --outfile infoRawText.dat --resnik /home/sid/Texts/WorldWar.txt

WorldWar.txt is the raw text file. infoRawText.dat is the output information content file.

(4)

Using a the Treebank corpus (WSJ articles) to generate an information content file with the option for Add-1 smoothing.

 treebankFreq.pl --outfile tbInfo.dat --smooth ADD1 /home/sid/treebank/raw/wsj

'tbInfo.dat' is the output file. '/home/sid/treebank/raw/wsj' is the path to the Wall Street Journal articles of the Treebank corpus. The '--smooth ADD1' requests the program to use Add-1 smoothing of the frequency counts. It adds 1 to all frequency counts to prevent any 0 frequency values.

wordVectors.pl

The WordNet::Similarity::vector_pairs module requires a file containing co-occurrence vectors for all the words in the WordNet glosses. The utility wordVectors.pl has been provided to generate such a database file. This utility generates co-occurrence vectors from the WordNet glosses themselves. Utilities to generate these from other corpora will be provided in future releases of this software.

 Usage: wordVectors.pl [{ [--stopfile STOPLIST]
                         [--wnpath WNPATH] [--noexamples] [--cutoff VALUE]
                         [--rhigh RHIGH] [--rlow RLOW] [--chigh CHIGH] 
                         [--clow CLOW] DBFILE 
                      | --help
                      | --version }]

This program writes out word vectors computed from WordNet glosses in a database file specified by filename DBFILE.

Options:

--stopfile

Option specifying a list of stopwords to not be considered while counting.

--wnpath

WNPATH specifies the path of the WordNet data files. Ordinarily, this path is determined from the $WNHOME environment variable. But this option overides this behavior.

--noexamples

Removes examples from the glosses before processing.

--cutoff

Option used to restrict the dimensions of the word vectors with an tf/idf cutoff. VALUE is the cutoff above which is an acceptable tf/idf value of a word.

--rhigh

RHIGH is the upper frequency cutoff of the words selected to have a word-vector entry in the database.

--rlow

RLOW is the lower frequency cutoff of the words selected to have a word-vector entry in the database.

--chigh

CHIGH is the upper frequency cutoff of words that form the dimensions of the word-vectors.

--clow

CLOW is the lower frequency cutoff of words that form the dimensions of the word-vectors.

--help

Displays help screen.

--version

Displays version information.

A useful utility called 'readDB.pl' has been included in the package, so as to be able to view the word vectors formed and stored in the database.

wnDepths.pl

The lch and wup measures both need files that contain certain types of information about the WordNet taxonomies. The wup measure needs a file that lists the depth of every synset in its taxonomy(-ies), and the lch measure needs a file that lists the maximum depth of each taxonomy. The wnDepths.pl program can generate both of these files.

 Usage: wnDepths.pl [[--wnpath=PATH] [--outfile=FILE] [--depthfile=FILE]
                     [--wps] [--verbose]]
                    | --help | --version]
Options:
--wnpath=PATH

PATH is the path to WordNet. The default is /usr/local/WordNet-2.0/dict on Unix and C:\WordNet\2.0\dict on Windows

--outfile=FILE

File to which the maximum depths of the taxonomies should be output.

--depthfile=FILE

File to which the depth of every synset should be output

--wps

output is in 'word#part_of_speech#sense format instead of offset format

--verbose

be verbose

--help

show this help message

--version

show version information

AUTHORS ^

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Siddharth Patwardhan, University of Utah
 sidd at cs.utah.edu

 Satanjeev Banerjee, Carnegie-Mellon University
 banerjee+ at cs.cmu.edu

 Jason Michelizzi 

COPYRIGHT ^

Copyright (c) 2005 - 2008, Ted Pedersen, Siddharth Patwardhan, Satanjeev Banerjee, and Jason Michelizzi

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.

syntax highlighting: