Contributed enhancement by Tani Hosokawa : Not a bug,
but an optimization.
Original version does inefficient repeated linear search over text that can't possibly match.
precaches locations of keywords.
Comparing 100 semi-randomly generated fairly similar documents of about 500 words each results in approx 90% speed increase,
the efficiency increases as the documents get larger.
Make various documentation/typo fixes as suggested by Alex Becker.
Found in CPAN bug list.
This release includes changes contributed by Myroslava Dzikovska that provide the full set of similarity scores programmatically.
She modified the interface so that the getSimilarity function returns a pair ($score,
%allScores) where %allScores is a hash of all possible scores that it computes.
She made it so that in scalar context it will only return $score,
so it is fully backwards compatible with the older versions.
She also changed the printing to STDERR,
to make it easier to use the code in filter scripts that depend on STDIN/STDOUT.
This release also inludes changes ontributed by Nathan Glen to allow test cases to pass on Windows.
The single quote used previously caused arguments to the script not to be passed corrected,
leading to test failures.
The single quotes have been changed to double quotes.
Added Dice coefficient to Overlaps.pm output. Dice is equivalent to F-measure, but formulated slightly differently so could be useful to catch errors.
Modified Overlaps method to provide lesk text matching score, that is the sum of the squared lengths of all phrasal matches (optionally normalized by the product of the lengths of the strings). It provides both Raw lesk and lesk (the normalized form) when run in verbose mode.
Reogranized some documentation to make it more clear that Overlaps is just one possible way of measuring similarity, and that other methods can and should be added.
Renamed text_compare.pl as the more natural and fitting text_similarity.pl
Made it possible for users to input strings directly via text_compare.pl and getSimilarityStrings. Previously it was only possible to directly measure the similarity of files, but now strings can be measured.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.