The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua-Orthon - Orthographic similarity of string to one or more others by Coltheart's N and related measures

VERSION

This is documentation for Version 0.03 of Lingua::Orthon.

SYNOPSIS

 use Lingua::Orthon 0.03;
 my $orthon = Lingua::Orthon->new();
 my $bool = $orthon->are_orthons('BANG', 'BARN'); # 0
 $bool = $orthon->are_orthons('BANG', 'BONG'); # 1
 my $idx = $orthon->index_diff('BANK', 'BARK'); # 2
 my $count = $orthon->index_identical('BANG', 'BARN'); # 2
 my (@diff) = $orthon->char_diff('BANG', 'BONG'); # (qw/A O/)
 $count = $orthon->onc(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # 4
 my $aref = $orthon->list_orthons(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # BAND, BING, RANG, BONG
 $count = $orthon->levenshtein('BANG', 'BARN'); # 2
 my $float = $orthon->old(
    test => 'BANG',
    sample => [qw/BAND COCO BING RANG BONG SONG/]); # ~= 1.67

DESCRIPTION

Lingua-Orthon provides measures of similarity of character strings based on their orthographic identity, as relevant to psycholinguistic research. Case- and mark-sensitivity for determining character equality can be controlled. Wraps to Levenshtein Distance methods, extended to the OLD-20 metric, are provided for convenience of comparison. No methods are explicitly exported; all methods are called in the object-oriented way.

SUBROUTINES/METHODS

new

 my $ortho = Lingua::Orthon->new();

Constructs/returns class object for accessing other methods.

Optionally, set the argument match_level to an integer value ranging from 0 to 3 to control case- and mark-sensitivity. See set_eq.

are_orthons

 $bool = $orthon->are_orthons('String1', 'String2');

Returns 0 or 1 (Coltheart's Boolean) if two given strings are orthographic neighbours by a 1-mismatch substitution: i.e., the strings are of the same size (are equal in character count) and there is only one discrepancy between them by a single substitution of a character in the same ordinal position (no additions, deletions or transpositions). So BANG and BAND are orthons by this measure (they differ only in the final letter), but BANG and BRANG are not (the letter R is an addition to BANG via BRANG, or a deletion from BRANG to BANG).

Identical strings: If two identical letter strings are given (BANG, BANG), they are defined as not being orthons: the number of index identical characters must be at least one less than the length of the string(s).

Case-sensitivity: By default, identity is defined case-insensitively; e.g., Bang and bang, and BaNG and bAnd are orthons. However, if match_level has been set (in new or set_eq) to a higher level than 1 (or as undef or 0), then case is respected; e.g., Bang and bang are orthons, but Bang and bing are NOT orthons (they involve substituting both the Bs and the second letters (a and i) ... but BaNG and BiNG, or BaNG and BING, are orthons. (This usefully applies to putting Coltheart's N (the sum of single-substitution orthons a string has within a lexicon) to questions of the featural versus lexical basis of neighbourhood effects).

See Coltheart et al. (1977) (in REFERENCES). The measure is computationally simple and economical, relative to other measures, such as based on a wider array of edit-types (e.g., Levenshtein Distance), that, while having greater explanatory power (Yarkoni et al., 2008), can tax resources on the order of days to effectively compute for a single string relative to a humanly memorable corpus.

index_identical

 $count = $orthon->index_identical('String1', 'String2');

Returns a count: the number of letters that are identical and in the same serial position among two given letter-strings.

For example, given BANG and BARN, 2 is returned for the two index-identical letters, B and A; N is in both strings, but it is ignored as it is the third letter in BANG but the fourth letter in BARN, and so not in the same serial position across the two words.

index_diff

 $posint = $orthon->index_diff('String1', 'String2');

Assuming the two strings are single-substitution orthons, returns the single index (anchored at zero) at which their letters differ. So if the two strings are "bring" and "being", the returned value is 1.

char_diff

 @ari = $orthon->char_diff('String1', 'String2');

Returns a list of the first two characters (letters) that, reading from left to right, differ between two given strings. If the strings are single-substitution orthons, these are the characters that make them so. So if the two strings are "bring" and "being", the returned list is ('r', 'e') - the order of these characters in the returned list respecting the order of the given strings. The search across the strings terminates as soon there is a mismatch; otherwise, it continues only for as long as the length of the shortest string.

The identity match (or mismatch) is sensitive to the setting of the equality function per case and marks; see set_eq.

onc, coltheart_n

 $int = $orthon->onc(test => CHARSTR, sample => AREF); 

Returns the orthographic neighbourhood count (ONC), a.k.a. Coltheart's N: the number of single-letter substitution orthons a particular string has with respect to a list of strings (or "lexicon") (Coltheart et al., 1977). So bat has two orthons (bad and cat) in the list (bad, bed, cat and day).

list_orthons

 $aref = $orthon->list_orthons(test => CHARSTR, sample => AREF);
 

Returns a reference to an array of single-substitution orthographic neighbours of a given test character-string that are among a given list of sample character-strings. The referenced is to an empty array if no orthons are found. The order of items in the returned array follows that in which they appear in the sample.

ldist, levenshtein

 $count = $orthon->ldist('String1', 'String2'); # minimal, strings will be lower-cased

Returns the Levenshtein Distance between two given letter strings, wrapping to various Perl module's that more or less implement the Levenshtein algorithm for efficiency and case-sensitivity. Specifically, if the match level has been set at 1 (to ignore case and diacritics), the method uses Text::Levenshtein::distance (which offers "ignoring diacritics"); otherwise, it uses Text::Levenshtein::XS::distance to ignore case but not marks (given present limitations of this module). The required case- and mark-sensitivity are set in the new or set_eq methods. By default, the match is made case- and mark-insensitively (by canned Perl eq).

old

 $mean = $orthon->old(test => CHARSTR, sample => AREF, lim => INT);

Returns the mean orthographic Levenshtein distance (OLD) of the smallest lim such edit distances for a given test string to all the strings in a sample list. Based on Yarkoni et al. (2008), where, with the value of lim is set to 20, the measure substantially contributes to prediction of performance in word recognition tasks. Here, if lim is not defined, not numeric, or greater than the size of the sample, then it is set by default to the size of the sample.

Levenshtein distance is calculated per the method ldist, wrapping to external modules with respect to the conditions of string equality set in new or set_eq. Different settings lead to different speed of calculation. The slowest calculation (by far) occurs if match_level => 1 so that case- and mark-insensitive matching occurs; this relies on the pure Perl implementation in Text::Levenshtein with its argument ignore_diacritics => 1. The fastest calculation (the default) occurs by setting match_level => 3, when exact characters are matched, e.g., B in the test-string and b in a sample-string at the same index across them are taken as unequal and so will count as a substitution. This relies on the C-implementation in Text::Levenshtein::XS. Ignore case but not marks with match_level => 2.

set_eq

 $orthon->set_eq(match_level => INT); # undef, 0, 1, 2 or 3

Sets the string-matching level used in the above methods. This is called implicitly in new when given a match_level, or with the default value of 0. This is adopted and slightly adapted from how Text::Levenshtein controls for case/diacritic-sensitive matching.

match_level = undef, 0

Match with respect to case and diacritics: same as 3 but simply by Perl's eq. So, e.g., éclair and eclair would be taken as non-identical, just as would Eclair and eclair.

This is the fastest option. The higher levels, as follow, use the eq() function in Unicode::Collate.

match_level = 1

Match ignoring case and diacritics: ber to BéZ involves 1 edit (from r to Z only)

match_level = 2

Match ignoring case but respect diacritics: "ber" to "BéZ" involves 2 edits (the "er" to "éZ")

match_level = 3

Match with respect to case and diacritics: "ber" to "BéZ" involves 3 edits (of all its letters)

So, for example, if the test string is "abbé", it could be picked up as having the single-substitution orthographic neighbour "able" if the match level is 1, but not if it is 0, 2 or 3.

DIAGNOSTICS

Invalid value '...' given as a match level

Argument match_level in new() or set_eq() needs to be an integer in range 0 .. 3, or undefined.

Need a single character string to test for orthons

Argument test for calculating ONC and OLD, and listing orthons, needs to be defined and not empty.

Need a single character string to test for orthons

Argument sample should reference an array of character-strings when calculating ONC and OLD, and listing orthons.

REFERENCES

Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and performance (Vol. 6, pp. 535-555). London, UK: Academic.

Yarkoni, T., Balota, D. A., & Yap, M. (2008). Moving beyond Coltheart's N: A new measure of orthographic similarity. Psychonomic Bulletin and Review, 15, 971-979. doi: 10.3758/PBR.15.5.971.

DEPENDENCIES

List::AllUtils

Number::Misc

Statistics::Lite

String::Util

Text::Levenshtein

Text::Levenshtein::XS

Unicode::Collate

AUTHOR

Roderick Garton, <rgarton at cpan.org>

SEE ALSO

String::LCSS_XS

String::Similarity

Text::Abbrev

BUGS AND LIMITATIONS

Please report any bugs or feature requests to bug-Lingua-Orthon-0.03 at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Orthon-0.03. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Lingua::Orthon

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2011-2018 Roderick Garton.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.