The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

wsd.pl - automatically assign a meaning to every word in a text

SYNOPSIS

 wsd.pl --context FILE --format FORMAT [--scheme SCHEME] [--type MEASURE] 
           [--config FILE] [--stoplist FILE] 
           [--window INT] [--contextScore NUM] [--pairScore NUM] 
           [--outfile FILE] [--trace INT] [--forcepos] [--nocompoundify] [--usemono][--backoff]
                | --help | --version

DESCRIPTION

Disambiguates each word in the context file using the specified relatedness measure (or WordNet::Similarity::lesk if none is specified).

OPTIONS

N.B., the = sign between the option name and the option parameter is optional.

--context=FILE

The input file containing the text to be disambiguated. This "option" is required.

--format=FORMAT

The format of the input file. For all formats there must be one sentence per line, one line per sentence. Valid values are:

raw

The input is raw text. Compounds will be identified, punctuation is ignored.

tagged

The input has been part-of-speech tagged with Penn Treebank tags. Compounds are not identified, and untagged words are ignored.

wntagged

The input has been part-of-speech tagged with WordNet tags (n, v, a, r). Compounds are not identified, and untagged words are ignored.

--scheme=SCHEME

The disambiguation scheme to use. Valid values are "normal", "fixed", "sense1", and "random". The default is "normal". In fixed mode, once a word is assigned a sense number, other senses of that word won't be considered when disambiguating words to the right of that context word. For example, if the context is

  dogs run very fast

and 'dogs' has been assigned sense number 1, only sense 1 of dogs will be used in computing relatedness values when disambiguating 'run', 'very', and 'fast'.

WordNet sense 1 disambiguation guesses that the correct sense for each word is the first sense in WordNet because the senses of words in WordNet are ranked according to frequency. The first sense is more likely than the second, the second is more likely than the third, etc. Random selects one of the possible senses of the target word randomly.

--type=MEAURE

The relatedness measure to be used. The default is WordNet::Similarity::lesk.

--config=FILE

The name of a configuration file for the specified relatedness measure.

--stoplist=FILE

A file containing regular expressions (as understood by Perl), surrounded by by slashes (e.g. /\d+/ removes any word containing a digit [0-9]). Any word in the text to be disambiguated that matches one of the regular expressions in the file is removed. Each regular expression must be on its own line, and any trailing whitespace is ignored.

Care must be taken when crafting a stoplist. For example, it is tempting to use /a/ to remove the word 'a', but that expression would result in all words containing the lowercase letter a to be removed. A better alternative would be /\ba\b/.

--window=INTEGER

Defines the size of the window of context. The default is 4. A window size of N means that there will be a total of N words in the context window, including the target word. If N is a (positive) even number, then there will be one more word on the left side of the target word than on the right.

For example, if the window size is 4, then there will be two words on the left side of the target word and one on the right. If the window is 5, then there will be two words on each side of the target word.

The minimum window size is 2. A smaller window would mean that there were no context words in the window.

--contextScore=REAL

If no sense of the target word achieves this minimum score, then no winner will be projected (e.g., it is assumed that there is no best sense or that none of the senses are sufficiently related to the surrounding context). The default is zero.

--pairScore=REAL

The minimum pairwise score between a sense of the target word and the best sense of a context word that will be used in computing the overall score for that sense of the target word. Setting this to be greater than zero (but not too large) will reduce noise. The default is zero.

--outfile=FILE

The name of a file to which output should be sent. This file will display one word and its sense per line.

--trace=INT

Turn tracing on/off. A value of zero turns tracing off, a non-zero value turns tracing on. The different trace levels can be added together to see the combined traces. The trace levels are:

  1 Show the context window for each pass through the algorithm.

  2 Display winning score for each pass (i.e., for each target word).

  4 Display the non-zero scores for each sense of each target
    word (overrides 2).

  8 Display the non-zero values from the semantic relatedness measures.

 16 Show the zero values as well when combined with either 4 or 8.
    When not used with 4 or 8, this has no effect.

 32 Display traces from the semantic relatedness module.
--forcepos

Turn part of speech coercion on. POS coercion attempts to force other words in the context window to be of the same part of speech as the target word. If the text is POS tagged, the POS tags will be ignored. POS coercion may be useful when using a measure of semantic similarity that only works with noun-noun and verb-verb pairs.

--nocompoundify

Disable compoundifying. By default AllWords.pm compoundifes the input raw text. Using this option will disable this.

--usemono

If this flag is on the only available sense is assignsed to the usemono words. By default this flag is off.

--backoff

Use the most frequent sense if the measure can't assign sense because no relatedness is found with the surrounding words. This happens for path based measures and Info content based measures.

SEE ALSO

 L<WordNet::SenseRelate::AllWords>

The main web page for SenseRelate is

 L<http://senserelate.sourceforge.net/>

There are several mailing lists for SenseRelate:

 L<http://lists.sourceforge.net/lists/listinfo/senserelate-users/>

 L<http://lists.sourceforge.net/lists/listinfo/senserelate-news/>

 L<http://lists.sourceforge.net/lists/listinfo/senserelate-developers/>

AUTHORS

 Jason Michelizzi 

 Ted Pedersen, University of Minnesota, Duluth
 E<lt>tpederse at d.umn.eduE<gt>

BUGS

Please report to senserelate-users mailing list.

COPYRIGHT

Copyright (C) 2004-2008 Jason Michelizzi and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.