Kyle R. Burton >
String-KeyboardDistance-1.01 >
String::KeyboardDistance

Module Version: 1.00
String::KeyboardDistance - String Comparison Algorithm

use String::KeyboardDistance qw(:all); my $s1 = 'Apple'; my $s2 = 'Wople'; # compute a match probability my $pr = qwerty_keyboard_distance_match('Apple','Wople'); # find the keyboard distance between two strings my $dst = qwerty_keyboard_distance('IBM','HAL'); # find the keyboard distance between two characters $dst = qwerty_char_distance('a','v'); print "maximum distance: $qwerty_max_distance\n";

This module implmements a version of keyboard distance for fuzzy string matching. Keyboard distance is a measure of the physical distance between two keys on a keyboard. For example, 'g' has a distance of 1 from the keys 'r', 't', 'y', 'f', 'h', 'v', 'b', and 'n'. Immediate diagonals (like ''r, 'y', 'v', and 'n') are considered to have a distance of 1 instead of 1.414 to help to prevent horizontal/vertical bias.

A match probability between two strings is computed from the total distances between corresponding characters divided by the length of the longer string multiplied by the maximum distance between the two furthest keys on the keyboard.

The functions in this module use a simple grid of keys. For the qwerty mapping, the grid is similar to the following:

| 0 1 2 3 4 5 6 7 8 9 10 11 12 13 --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 0 | ` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | = | | 1 | | q | w | e | r | t | y | u | i | o | p | [ | ] | \ | 2 | | a | s | d | f | g | h | j | k | l | ; | ' | | | 3 | | z | x | c | v | b | n | m | , | . | / | | | | --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The grids for both qwerty and dvorak are based on PC style keyboards. Shifted characters have the same coordinates (a and A, 6 and ^).

This module exports no symbols by default. The following functions are available for export through EXPORT_OK:

build_qwerty_map max_qwerty_distance qwerty_char_distance qwerty_keyboard_distance qwerty_keyboard_distance_match build_dvorak_map max_dvorak_distance dvorak_char_distance dvorak_keyboard_distance dvorak_keyboard_distance_match grid_distance

The following varialbes are availalbe for export through EXPORT_OK:

@qwerty_grid $qwerty_map $qwerty_max_distance @dvorak_grid $dvorak_map $dvorak_max_distance

Additionaly, this module supports the following export tags:

:Functions - import all functions :Variables - import all variables :all - import both functions and variables

param: array reference to receive the map [optional] return: the map (array reference)

This function builds a map of each character to its corresponding location on the keyboard. The location is derived by looking at the keyboard as a simple grid in which the location of keys on the keyboard are considereed to be the same as their shifted values. The following keys are considered to have the same location:

1 and ! r and r / and ?

The map is an array, where the index is the value returned by chr() for the character at that location. The value is an array ref describing the location of the character on the keyboard. The first element is the y position, the second is the x, and the third represents the shift value (0 for non-shifted, 1 for shifted). All non-keyable characters, including tabs and spaces, will have undef values in the map, meaning they have no point on the grid. These non-key characterss should be considered to be the maximum distance away from any other character except themselves.

The map is constructed at run-time from the @qwerty_grid package global, and cached in the $qwerty_map package global.

param: array reference to receive the map [optional] return: the map (array reference)

This function is identical to build_qwerty_map, with the exception that it builds a map for dvorak keyboards, and the map is constructed from the @dvorak_grid package global, and cached in the $dvorak_map package global.

return: maximum distance

This function dynamicly computes the maximum distance between keys in the qwerty map. The maximum key distance is stored in the $qwerty_max_distance package global.

return: maximum distance

This function dynamicly computes the maximum distance between keys in the dvorak map. The maximum key distance is stored in the $dvorak_max_distance package global.

param: char 1 param: char 2 return: distance

This function computes the distance between the two characters passed on a qwerty keyboard.

param: char 1 param: char 2 return: distance

This function computes the distance between the two characters passed on a dvorak keyboard.

param: x1 - point 1's x coordinate param: y1 - point 1's y coordinate param: x2 - point 2's x coordinate param: y2 - point 2's y coordinate return: distance between points

This function returns the distance between two points. If the two points have an x distance of 1, and a y distance of 1, then they are considered to be a distance of 1 apart. This is meant to help prevent horizontal/vertical bias in the distancing function. Otherwise we use the following formula:

sqrt( (x1 - x2)**2 + (y1 - y2)**2 );

param: string1 param: string2 return: distance

Returns the sum of the distances between corresponding characters in the two strings. If one string is longer than the other the remaining characters are counted as having the same value as the maximum distance.

param: string1 param: string2 return: probability of match

The probability of a match is:

Pr = 1 - ( D / (L * M) )

Where D is the distance between the two strings, L is the length of the longer string, and M is the maximum character distance.

param: string1 param: string2 return: distance

Returns the sum of the distances between corresponding characters in the two strings. If one string is longer than the other the remaining characters are counted as having the same value as the maximum distance.

param: string1 param: string2 return: probability of match

The probability of a match is computed in the same way as for qwerty_keyboard_distance_match().

The grids are specific to US style PC keyboards. It is possible to substitue and use your own keyboard maps, but it's not well documented. Also, even US keyboards differ, for instance, in where they have the backslash and pipe characters. The grids in this module assume that it's in the last position of the second row.

The space and tab keys are ignored entierly. Whitespace is irrelevant for the most common use of this module (for the author at least), which is matching words or names.

Numbers are distanced from their locations at the top of the keyboard. It should be an option to perform a keyboard distance using only the numeric keypad.

If one string is a substring of the other (abstained vs stained), the matching could be improved by either starting at the substring offset, or by prepending spaces onto the front of the shorter string so it matched the substring offset.

Currently the maps, and the max lengths are computed dynamicly. The module would load faster if these were hard coded data.

It should be possible to speed up the qwerty_char_distance and dvorak_char_distance functions by caching (for example, by using the Memoize module).

Kyle R. Burton krburton@cpan.org kburton@hmsonline.com HMS 625 Ridge Pike Building E Suite 400 Conshohocken, PA 19428

perl(1). String::Approx. Text::DoubleMetaphone. String::Similarity.

syntax highlighting: