<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Measures</title>
<link rel="stylesheet" href="sim-style.css" type="text/css">
</head>
<body>
<h1>Measures</h1>
<h2>Path length</h2>
<p>A simple node-counting scheme (path). The relatedness score is inversely
proportional to the number of nodes along the shortest path between
the synsets. The shortest possible path occurs when the two synsets
are the same, in which case the length is 1. Thus, the maximum
relatedness value is 1.
</p>
<h2>Leacock & Chodorow</h2>
<p>The relatedness measure proposed by Leacock and Chodorow (lch) is
-log (length / (2 * D)), where length is the length of the shortest
path between the two synsets (using node-counting) and D is the
maximum depth of the taxonomy.
</p>
<p>The fact that the lch measure takes into account the depth of the
taxonomy in which the synsets are found means that the behavior of
the measure is profoundly affected by the presence or absence of a
unique root node. If there is a unique root node, then there are only
two taxonomies: one for nouns and one for verbs. All nouns, then,
will be in the same taxonomy and all verbs will be in the same taxonomy.
D for the noun taxonomy will be somewhere around 18, depending upon
the version of WordNet, and for verbs, it will be 14. If the root node
is not being used, however, then there are nine different noun
taxonomies and over 560 different verb taxonomies, each with a
different value for D.
</p>
<p>If the root node is not being used, then it is possible for synsets
to belong to more than one taxonomy. For example, the synset containing
turtledove#n#2 belongs to two taxonomies: one rooted at group#n#1 and
one rooted at entity#n#1. In such a case, the relatedness is computed
by finding the LCS that results in the shortest path between the
synsets. The value of D, then, is the maximum depth of the taxonomy
in which the LCS is found. If the LCS belongs to more than one taxonomy,
then the taxonomy with the greatest maximum depth is selected (i.e.,
the largest value for D).
</p>
<h2>Wu & Palmer</h2>
<p>The Wu & Palmer measure (wup) calculates relatedness by considering the
depths of the two synsets in the WordNet taxonomies, along with the
depth of the LCS. The formula is
score = 2*depth(lcs) / (depth(s1) + depth(s2)). This means that
0 < score <= 1. The score can never be zero because the depth
of the LCS is never zero (the depth of the root of a taxonomy is one).
The score is one if the two input synsets are the same.
</p>
<h2>Resnik</h2>
<p>The related value is equal to the information content (IC) of the
Least Common Subsumer (LCS) (most informative subsumer). This means
that the value will always be greater-than or equal-to zero. The
upper bound on the value is generally quite large and varies depending
upon the size of the corpus used to determine information content values.
To be precise, the upper bound should be ln (N) where N is the number of
words in the corpus.</p>
<h2>Jiang & Conrath</h2>
<p>The relatedness value returned by the jcn measure is equal to
1 / jcn_distance, where jcn_distance is equal to
IC(synset1) + IC(synset2) - 2 * IC(lcs).
</p>
<p>There are two special cases that need to be handled carefully when
computing relatedness; both of these involve the case when jcn_distance
is zero.
</p>
<p>In the first case, we have ic(synset1) = ic(synset2) = ic(lcs) = 0.
In an ideal world, this would only happen when all three concepts,
viz. synset1, synset2, and lcs, are the root node. However, when a
synset has a frequency count of zero, we use the value 0 for the
information content. In this first case, we return 0 due to lack of
data.
</p>
<p>In the second case, we have ic(synset1) + ic(synset2) = 2 * ic(ics).
This is almost always found when synset1 = synset2 = lcs (i.e., the
two input synsets are the same). Intuitively this is the case of
maximum relatedness, which would be infinity, but it is impossible
to return infinity. Insteady we find the smallest possible distance
greater than zero and return the multiplicative inverse of that distance.
</p>
<h2>Lin</h2>
<p>The relatedness value returned by the lin measure is a number equal
to 2 * IC(lcs) / (IC(synset1) + IC(synset2)). Where IC(x) is the
information content of x. One can observe, then, that the relatedness
value will be greater-than or equal-to zero and less-than or equal-to
one.
</p>
<p>
If the information content of any of either synset1 or synset2 is zero,
then zero is returned as the relatedness score, due to lack of data.
Ideally, the information content of a synset would be zero only if
that synset were the root node, but when the frequency of a synset
is zero, we use the value of zero as the information content because
of a lack of better alternatives.
</p>
<h2>Adapted Lesk (Extended Gloss Overlaps) </h2>
<p>The Extended Gloss Overlaps measure (lesk) works by finding overlaps in the
glosses of the two synsets. The relatedness score is the sum of the
squares of the overlap lengths. For example, a single word overlap
results in a score of 1. Two single word overlaps results in a score of
2. A two word overlap (i.e., two consecutive words) results in a score of
4. A three word overlap results in a score of 9.
</p>
<h2>Gloss Vector</h2>
<p>The Gloss Vector measure (vector) works by forming second-order co-occurrence
vectors from the glosses or WordNet definitions of concepts. The
relatedness of two concepts is determined as the cosine of the angle
between their gloss vectors. In order to get around the data sparsity
issues presented by extremely short glosses, this measure augments the
glosses of concepts with glosses of adjacent concepts as defined by
WordNet relations.
</p>
<h2>Gloss Vector (pairwise)</h2>
<p>The Gloss Vector (pairwise) measure (vector_pairs) is very similar to the "regular" Gloss
Vector measure, except in the way it augments the glosses of concepts with
adjacent glosses. The regular Gloss Vector measure first combines the
adjacent glosses to form one large "super-gloss" and creates a single vector
corresponding to each of the two concepts from the two "super-glosses". The
pairwise Gloss Vector measure, on the other hand, forms separate vectors
corresponding to each of the adjacent glosses (does not form a single super
gloss). For example separate vectors will be created for the hyponyms, the
holonyms, the meronyms, etc. of the two concepts. The measure then takes the
sum of the individual cosines of the corresponding gloss vectors, i.e. the
cosine of the angle between the hyponym vectors is added to the cosine of the
angle between the hlonym vectors, and so on. From empirical studies, we have
found that the regular Gloss Vector measure performs better than the pairwise
Gloss Vector measure.
</p>
<h2>Hirst & St-Onge</h2>
<p>This measure (hso) works by finding lexical chains linking the two word
senses. There are three classes of relations that are considered:
extra-strong, strong, and medium-strong. The maximum relatedness
score is 16.
</p>
<h2>Random</h2>
<p>The relatedness values are simply randomly generated numbers.
This is intended only to be used as a baseline.
</p>
</body>
</html>