NAME
Text::Fuzzy::PP - partial or fuzzy string matching using edit distances
(Pure Perl)
SYNOPSIS
use Text::Fuzzy::PP;
my $tf = Text::Fuzzy::PP->new ('boboon');
print "Distance is ", $tf->distance ('babboon'), "\n";
# Prints "Distance is 2"
my @words = qw/the quick brown fox jumped over the lazy dog/;
my $nearest = $tf->nearest (\@words);
print "Nearest array entry is ", $words[$nearest], "\n";
# Prints "Nearest array entry is brown"
DESCRIPTION
This module is a drop in, pure perl, substitute for Text::Fuzzy. All
documentation is taken directly from Text::Fuzzy.
This module calculates the Levenshtein edit distance between words, and
does edit-distance-based searching of arrays and files to find the
nearest entry. It can handle either byte strings or character strings
(strings containing Unicode), treating each Unicode character as a
single entity.
It is designed for high performance in searching for the nearest to a
particular search term over an array of words or a file, by reducing the
number of calculations which needs to be performed.
It supports either bytewise edit distances or Unicode-based edit
distances:
use utf8;
my $tf = Text::Fuzzy::PP->new ('あいうえお☺');
print $tf->distance ('うえお☺'), "\n";
# prints "2".
The default edit distance is the Levenshtein edit distance, which
applies an equal weight of one to additions (`cat' -> `cart'),
substitutions (`cat' -> `cut'), and deletions (`carp' -> `cap').
Optionally, the Damerau-Levenshtein edit distance, which additionally
allows transpositions (`salt' -> `slat') may be selected using the
method transpositions_ok.
METHODS
new
my $tf = Text::Fuzzy::PP->new ('bibbety bobbety boo');
Create a new Text::Fuzzy::PP object from the supplied word.
distance
my $dist = $tf->distance ($word);
Return the edit distance to `$word' from the word used to create the
object in new.
nearest
my $index = $tf->nearest (\@words);
This returns the index of the nearest element in the array to the
argument to new. If none of the elements are less than the maximum
distance away from the word, `$index' is -1.
if ($index >= 0) {
printf "Found at $index, distance was %d.\n",
$tf->last_distance ();
}
Use set_max_distance to alter the maximum distance used.
If there is more than one word with the same distance in `@words', this
returns the first of them.
last_distance
my $last_distance = $tf->last_distance ();
The distance from the previous match closest match. This is used in
conjunction with nearest to find the edit distance to the previous
match.
set_max_distance
# Set the max distance.
$tf->set_max_distance (3);
Set the maximum edit distance of `$tf'. The default maximum distance is
10. Set the maximum distance to a low value to improve the speed of
searches over lists with nearest, or to reject unlikely matches. When
searching for a near match, anything with an edit distance of a value at
least as high as the maximum is rejected without computing the exact
distance. To compute exact distances, call this method with zero or
undefined, the maximum edit distance is switched off, and whatever the
nearest match is is accepted.
get_max_distance
# Get the maximum edit distance.
print "The max distance is ", $tf->get_max_distance (), "\n";
Get the maximum edit distance of `$tf'. The default is set to 10. The
maximum distance may be set with set_max_distance.
scan_file
$tf->scan_file ('/usr/share/dict/words');
Scan a file to find the nearest match to the word used in new. This
assumes that the file contains lines of text separated by newlines and
finds the closest match in the file.
This does not currently support Unicode-encoded files.
transpositions_ok
$tf->transpositions_ok (1);
A true value in the argument changes the type of edit distance used to
allow transpositions, such as `clam' and `calm'. Initially
transpositions are not allowed, giving the Levenshtein edit distance. If
transpositions are used, the edit distance becomes the
Damerau-Levenshtein edit distance. A false value disallows
transpositions:
$tf->transpositions_ok (0);
PRIVATE METHODS
These methods are not expected to be useful for the general user. They
may be useful in benchmarking the module and checking its correctness.
no_alphabet
$tf->no_alphabet (1);
This turns off alphabetizing of the string. Alphabetizing is a filter
used in nearest where the intersection of all the characters in the two
strings is computed, and if the alphabetical difference of the two
strings is greater than the maximum distance, the match is rejected
without applying the dynamic programming algorithm. This increases
speed, because the dynamic programming algorithm is slow.
The alphabetizing should not ever reject anything which is a legitimate
match, and it should make the program run faster in almost every case.
The only envisaged uses of switching this off are checking that the
algorithm is working correctly, and benchmarking performance.
get_trans
my $trans_ok = $tf->get_trans ();
This returns the value set by transpositions_ok.
unicode_length
my $length = $tf->unicode_length ();
This returns the length in characters (not bytes) of the string used in
new. If the string is not marked as Unicode, it returns the undefined
value. In the following, `$l1' should be equal to `$l2'.
use utf8;
my $word = 'ⅅⅆⅇⅈⅉ';
my $l1 = length $word;
my $tf = Text::Fuzzy::PP->new ($word);
my $l2 = $tf->unicode_length ();
ualphabet_rejections
my $rejected = $tf->ualphabet_rejections ();
After running nearest over an array, this returns the number of entries
of the array which were rejected using only the alphabet. Its value is
reset to zero each time nearest is called.
length_rejections
my $rejected = $tf->length_rejections ();
After running nearest over an array, this returns the number of entries
of the array which were rejected because the length difference between
them and the target string was larger than the maximum distance allowed.
ACKNOWLEDGEMENTS
Text::Fuzzy is authored by Ben Bullock (BKB). The levenshtein algorithm,
the documentation, and Text::Fuzzy's tests were taken directly from
Text::Fuzzy.
BUGS
Please report bugs to:
https://rt.cpan.org/Public/Dist/Display.html?Name=Text-Fuzzy-PP
AUTHOR
Nick Logan <ugexe@cpan.org>
LICENSE AND COPYRIGHT
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.