
Text::DeDuper - near duplicates detection module

use Text::DeDuper;
$deduper = new Text::DeDuper();
$deduper->add_doc("doc1", $doc1text);
$deduper->add_doc("doc2", $doc2text);
@similar_docs = $deduper->find_similar($doc3text);
...
# delete near duplicates from an array of texts
$deduper = new Text::DeDuper();
foreach $text (@texts)
{
next if $deduper->find_similar($text);
$deduper->add_doc($i++, $text);
push @no_near_duplicates, $text;
}

This module uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.
Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.

$deduper = new Text::DeDuper(<attribute-value-pairs>);
Create a new DeDuper instance. Supported attributes are described bellow, in the Attributes section.
$deduper->add_doc($document_id, $document_text);
Add a new document to the DeDuper's database. The $document_id must be unique for each document.
$deduper->find_similar($document_text);
Returns (possibly empty) array of document IDs of documents in the DeDuper's database similar to the $document_text. This can be very simply used for testing whether a near-duplicate document is in the database:
if ($deduper->find_similar($document_text))
{
print "at least one near duplicate found";
}
$deduper->clean()
Removes all documents from DeDuper's database.

Attributes can be set using the constructor:
$deduper = new Text::DeDuper(
ngram_size => 4,
encoding => 'iso-8859-1'
);
... or using the object methods:
$deduper->ngram_size(4);
$deduper->encoding('iso-8859-1');
The object methods can also be used for retrieving the values of the attributes:
$ngram_size = $deduper->ngram_size();
@stoplist = $deduper->stoplist();
The characters encoding of processed texts. Must be set to correct value so that alphabetical characters could be detected. Accepted values are those supported by the Encode module (see Encode::Supported).
default: 'utf8'
The similarity treshold defines how similar two documents must be to be considered near duplicates. The boundary values are 0 and 1. The similarity value of 1 indicates that the documents are exactly the same. The value of 0 on the other hand means that the documents do not share any n-gram.
Any two documents will have the similarity value below the default treshold unless they share a significant part of text.
default: 0.2
The document similarity is based on the information of how many n-grams the documents have in common. An n-gram is a sequence of any n immeadiately subsequent words. For example the text
she sells sea shells on the sea shore
contains following 5-grams:
she sells sea shells on
sells sea shells on the
sea shells on the sea
shells on the sea shore
This attribute specifies the value of n (the size of n-gram).
default: 5
The stoplist is a list of very frequent words for given language (for English e.g. a, the, is, ...). It is a good idea to remove the stoplist words from texts before similarity is computed, because it is quite likely that two documents will share n-grams of frequent words even if they are not similar at all.
The stoplist can be specified both as an array of words and as a name of a file where the words are stored one per line:
$deduper->stoplist('a', 'the', 'is', @next_stopwords);
$deduper->stoplist('/path/to/english_stoplist.txt');
Do not worry if you do not have a stoplist for your language. DeDuper will do pretty good job even without the stoplist.
default: empty

For decoding texts in various characters encodings into Perl's internal form.
For n-grams hashing optimisation.

Please report any bugs or feature requests to bug-Text-DeDuper@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Encode, Encode::Supported, Digest::MD4
http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html
Contains among other things definition of the resemblance measure.

Jan Pomikalek, <xpomikal@fi.muni.cz>

Copyright 2006 Jan Pomikalek, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.