Ted Pedersen > Text-Similarity > Text::Similarity::Overlaps

Download:
Text-Similarity-0.10.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  1
View/Report Bugs
Module Version: 0.05   Source  

NAME ^

Text::Similarity::Overlaps - Score the Overlaps Found Between Two Strings Based on Literal Text Matching

SYNOPSIS ^

          # you can measure the similarity between two input strings : 
          # if you don't normalize the score, you get the number of matching words
          # if you normalize, you get a score between 0 and 1 that is scaled based
          # on the length of the strings

          use Text::Similarity::Overlaps;
 
          # my %options = ('normalize' => 1, 'verbose' => 1);
          my %options = ('normalize' => 0, 'verbose' => 0);
          my $mod = Text::Similarity::Overlaps->new (\%options);
          defined $mod or die "Construction of Text::Similarity::Overlaps failed";

          my $string1 = 'this is a test for getSimilarityStrings';
          my $string2 = 'we can test getSimilarityStrings this day';

          my $score = $mod->getSimilarityStrings ($string1, $string2);
          print "There are $score overlapping words between string1 and string2\n";

          # you may want to measure the similarity of a document
          # sentence by sentence - the below example shows you
          # how - suppose you have two text files file1.txt and
          # file2.txt - each having the same number of sentences.
          # convert those files into multiple files, where each
          # sentence from each file is in a separate file. 

          # if file1.txt and file3.txt each have three sentences, 
          # filex.txt will become sentx1.txt sentx2.txt sentx3.txt

          # this just calls getSimilarity( ) for each pair of sentences

          use Text::Similarity::Overlaps;
          my %options = ('normalize' => 1, 'verbose' =>1, 
                                        'stoplist' => 'stoplist.txt');
          my $mod = Text::Similarity::Overlaps->new (\%options);
          defined $mod or die "Construction of Text::Similarity::Overlaps failed";

          @file1s = qw / sent11.txt sent12.txt sent13.txt /;
          @file2s = qw / sent21.txt sent22.txt sent23.txt /;

          # assumes that both documents have same number of sentences 

          for ($i=0; $i <= $#file1s; $i++) {
                  my $score = $mod->getSimilarity ($file1s[$i], $file2s[$i]);
                  print "The similarity of $file1s[$i] and $file2s[$i] is : $score\n";
          }

          my $score = $mod->getSimilarity ('file1.txt', 'file2.txt');
          print "The similarity of the two files is : $score\n";

DESCRIPTION ^

This module computes the similarity of two text documents or strings by searching for literal word token overlaps. This just means that it determines how many word tokens are are identical between the two strings. Various scores are computed based on the number of shared words, and the length of the strings.

At present similarity measurements are made between entire files or strings, and finer granularity is not supported. Files are treated as one long input string, so overlaps can be found across sentence and paragraph boundaries.

Files are first converted into strings by getSimilarity(), then getSimilarityStrings() does the actual processing. It counts the number of overlaps (matching words) and finds the longest common subsequences (phrases) between the two strings. However, most of the measures except for lesk do not use the information about phrasal matches.

Text::Similarity::Overlaps returns the F-measure, which is a normalized value between 0 and 1. Normalization can be turned off by specifying --no-normalize, in which case the raw_score is returned, which is simply the number of words that overlap between the two strings.

In addition, Overlaps returns the cosine, E-measure, precision, recall, Dice coefficient, and Lesk scores in the allScores table.

     precision = raw_score / length_file_2
     recall = raw_score / length_file_1
     F-measure = 2 * precision * recall / (precision + recall)
     Dice = 2 * raw_score / (sum of string lengths)
     E-measure = 1 - F-measure
     Cosine = raw_score / sqrt (precision + recall)
     Lesk = sum of the squares of the length of phrasal matches  
         (normalized by dividing by the product of the string lengths)

The raw_score is simply the number of matching words between the two inputs, without respect to their order. Note that matches are literal and must be exact, so 'cat' and 'cats' do not match. This corresponds to the idea of the intersection between the two strings.

None of these measures (except lesk) considers the order of the matches. In those cases 'jim bit the dog' and 'the dog bit jim' are considered exact matches and will attain the highest possible matching score, which would be a raw_score of 4 if not normalized and 1 if the score is normalized (which would result in the f-measure being returned).

lesk is different in that it looks for phrasal matches and scores them more highly. The lesk measure is based on the measure of the same name included in WordNet::Similarity. There it is used to match the overlapping text found in the gloss entries of the lexical database / dictionary WordNet in order to measure semantic relatedness.

The lesk measure finds the length of all the overlaps and squares them. It then sums those scores, and if the score is normalized divides them by the product of the lengths of the strings. For example:

        the dog bit jim
        jim bit the dog

The raw_score is 4, since the two strings are made up of identical words (just in different orders). The F-measure is equal to 1, as are the Cosine, and the Dice Coefficient. In fact, the F-Measure and the Dice Coefficient are always equivalent, but both are presented since some users may be more familiar with one formulation versus the other.

The raw_lesk score is 2^2 + 1 + 1 = 6, because 'the dog' is a phrasal match between the strings and thus contributes it's length squared to the raw_lesk score. The normalized lesk score is 0.375, which is 6 / (4 * 4), or the raw_lesk score divided by the product of the lengths of the two strings. Note that the normalized lesk score has a maximum value of 1, since if there are n words in the two strings, then their maximum overlap is n words, which receives a raw_lesk score of n^2, which is the divided by the product of the string lengths, which is again n^2..

There is some cleaning of text performed automatically, which includes removal of most punctuation except embedded apostrophes and underscores. All text is made lower case. This occurs both for file and string input.

SEE ALSO ^

 L<http://text-similarity.sourceforge.net>

AUTHOR ^

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Jason Michelizzi

Last modified by : $Id: Overlaps.pm,v 1.24 2008/04/06 03:00:38 tpederse Exp $

COPYRIGHT AND LICENSE ^

Copyright (C) 2004-2008 by Jason Michelizzi and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

syntax highlighting: