The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Text::Fuzzy - partial or fuzzy string matching using edit distances

SYNOPSIS

    use Text::Fuzzy;
    my $tf = Text::Fuzzy->new ('boboon');
    print "Distance is ", $tf->distance ('babboon'), "\n";
    # Prints "Distance is 2"
    my @words = qw/the quick brown fox jumped over the lazy dog/;
    my $nearest = $tf->nearest (\@words);
    print "Nearest array entry is ", $words[$nearest], "\n";
    # Prints "Nearest array entry is brown"

DESCRIPTION

This module calculates the Levenshtein edit distance between words, and does edit-distance-based searching of arrays and files to find the nearest entry. It can handle either byte strings or character strings (strings containing Unicode), treating each Unicode character as a single entity.

It is designed for high performance in searching for the nearest to a particular search term over an array of words or a file, by reducing the number of calculations which needs to be performed.

It supports either bytewise edit distances or Unicode-based edit distances:

    use utf8;
    my $tf = Text::Fuzzy->new ('あいうえお☺');
    print $tf->distance ('うえお☺'), "\n";
    # prints "2".

The default edit distance is the Levenshtein edit distance, which applies an equal weight of one to additions (cat -> cart), substitutions (cat -> cut), and deletions (carp -> cap). Optionally, the Damerau-Levenshtein edit distance, which additionally allows transpositions (salt -> slat) may be selected using the method transpositions_ok.

METHODS

new

    my $tf = Text::Fuzzy->new ('bibbety bobbety boo');

Create a new Text::Fuzzy object from the supplied word.

distance

    my $dist = $tf->distance ($word);

Return the edit distance to $word from the word used to create the object in "new".

nearest

    my $index = $tf->nearest (\@words);

Return the index of the nearest element in the array to the argument to "new". If none of the elements are less than the maximum distance away from the word, $index is -1.

    if ($index >= 0) {
        print "Found at $index.\n";
    }

last_distance

    my $last_distance = $tf->last_distance ();

The distance from the previous match. This is usually used in conjunction with "nearest" to find the edit distance to the previous match.

get_max_distance

    # Get the maximum edit distance.
    print "The max distance is ", $tf->get_max_distance (), "\n";

Get the maximum edit distance of $tf. The default is set to 10.

set_max_distance

    # Set the max distance.
    $tf->set_max_distance (3);

Set the maximum edit distance of $tf. The default is set to 10. If this is called with an undefined value, the maximum edit distance is switched off.

scan_file

    $tf->scan_file ('/usr/share/dict/words');

Scan a file to find the nearest match to the word used in "new". This assumes that the file contains lines of text separated by newlines and finds the closest match in the file.

This does not currently support Unicode-encoded files.

transpositions_ok

    $tf->transpositions_ok (1);

This changes the type of edit distance used to allow or disallow transpositions. Initially transpositions are not allowed.

EXAMPLES

misspelt-web-page.cgi

The file examples/misspelt-web-page.cgi is an example of a CGI script which does something similar to the Apache mod_speling module, offering spelling corrections for mistyped URLs and sending the user to a correct page.

See the file in the distribution for details. See also http://www.lemoda.net/perl/perl-mod-speling/ for how to set up .htaccess to use the script.

spell-check.pl

The file examples/spell-check.pl is a spell checker. It uses a dictionary of words specified by a command-line option "-d":

    spell-check.pl -d /usr/dict/words file1.txt file2.txt

It prints out any words which look like spelling mistakes, using the dictionary.

Because the usual Unix dictionary doesn't have plurals, it uses Lingua::EN::PluralToSingular, to convert nouns into singular forms. Unfortunately it still misses past participles and past tenses of verbs.

extract-kana.pl

The file examples/extract-kana.pl extracts the kana entries from "edict", a freely-available Japanese to English electronic dictionary, and does some fuzzy searches on them. It requires a local copy of the file to run. This script demonstrates the use of Unicode searches with Text::Fuzzy.

ACKNOWLEDGEMENTS

The edit distance including transpositions was contributed by Nick Logan (UGEXE). Some of the tests in t/trans.t are taken from the Text::Levenshtein::Damerau::XS module.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2012-2013 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.