The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    Text::Compare - Language sensitive text comparison

SYNOPSIS
        use Text::Compare;
        # the instant way:
        my $tc = new Text::Compare( memoize => 1, stip_html => 0 );

        my $sim = $ts->similarity($text_a, $text_b);
        #$sim will be between 0 and 1

        # second way (cache lists):
        my $tc2 = new Text::Compare( strip_html => 1 );

        # make a language sensitive word hash:
        my %wordhash = $tc2->get_words($some_text);

        $tc2->first_list(\%wordhash);

        foreach my $list (@wordlists) {
           #list is a hashref
           $tc2->second_list($list);

           print $tc2->similarity();
        }

        # third way (cache texts) 
        my $tc3 = new Text::Compare();

        $tc3->first($some_text);
        $tc3->second($some_other_text);

        print $tc3->similarity;
     
DESCRIPTION
    Text::Compare is an attempt to write a high speed text compare tool
    based on Vector comparision which uses language dependend stopwords.
    Text::Compare uses Lingua::Identify to find the language of the given
    texts, then uses Lingua::StopWords to get the stopwords for the given
    language and finally uses Linuga::Stem to find word stems.

METHODS
    new( memoize => <boolean>, strip_html => <boolean> )
        Creates a new Text::Compare object. Per default, Text::Compare usese
        memoize to cache some of the calls. See Memoize for details. If you
        don't want that to happen, initialize it with memoize => 0.
        Furthermore, Text::Compare uses HTML::Strip to stip off the HTML
        found in the text. If you are sure that you don't have any HTML in
        your data or simply want to use it, deactivate it with strip_html =>
        0.

    similarity($text_a, $text_b)
        Compares both texts and returns a similarity value between 0 and 1.
        Text::Compare does all this language magic, therefore two texts
        which address the same topic but are in different languages might
        get relatively high values.

LANGUAGES
    Text::Compare uses the set of languages which is common to
    Lingua::Identify, Lingua::Stem and Lingua::StopWords, namely:

    da
    de
    en
    fr
    it
    no
    pt
    sv

AUTHOR
    Marcus Thiesen, "<marcus@thiesen.org>"

BUGS
    Please report any bugs or feature requests to
    "bug-text-compare@rt.cpan.org", or through the web interface at
    <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Compare>. I will be
    notified, and then you'll automatically be notified of progress on your
    bug as I make changes.

ACKNOWLEDGEMENTS
    The actual code is heavilly based on Search::VectorSpace by Maciej
    Ceglowski.

COPYRIGHT & LICENSE
    Copyright 2005 Marcus Thiesen, All Rights Reserved.

    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

CVS
    $Id: Compare.pm,v 1.8 2005/03/04 13:43:30 marcus Exp $