Text::DeDuper - near duplicates detection module
use Text::DeDuper; $deduper = new Text::DeDuper(); $deduper->add_doc("doc1", $doc1text); $deduper->add_doc("doc2", $doc2text); @similar_docs = $deduper->find_similar($doc3text); ... # delete near duplicates from an array of texts $deduper = new Text::DeDuper(); foreach $text (@texts) { next if $deduper->find_similar($text); $deduper->add_doc($i++, $text); push @no_near_duplicates, $text; }
This module uses the resemblance measure as proposed by Andrei Z. Broder at al (http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html) to detect similar (near-duplicate) documents based on their text.
Note of caution: The module only works correctly with languages where texts can be tokenised to words by detecting alphabetical characters sequences. Therefore it might not provide very good results for e.g. Chinese.
$deduper = new Text::DeDuper(<attribute-value-pairs>);
Create a new DeDuper instance. Supported attributes are described bellow, in the Attributes section.
$deduper->add_doc($document_id, $document_text);
Add a new document to the DeDuper's database. The $document_id must be unique for each document.
$document_id
$deduper->find_similar($document_text);
Returns (possibly empty) array of document IDs of documents in the DeDuper's database similar to the $document_text. This can be very simply used for testing whether a near-duplicate document is in the database:
$document_text
if ($deduper->find_similar($document_text)) { print "at least one near duplicate found"; }
$deduper->clean()
Removes all documents from DeDuper's database.
Attributes can be set using the constructor:
$deduper = new Text::DeDuper( ngram_size => 4, encoding => 'iso-8859-1' );
... or using the object methods:
$deduper->ngram_size(4); $deduper->encoding('iso-8859-1');
The object methods can also be used for retrieving the values of the attributes:
$ngram_size = $deduper->ngram_size(); @stoplist = $deduper->stoplist();
The characters encoding of processed texts. Must be set to correct value so that alphabetical characters could be detected. Accepted values are those supported by the Encode module (see Encode::Supported).
default: 'utf8'
The similarity treshold defines how similar two documents must be to be considered near duplicates. The boundary values are 0 and 1. The similarity value of 1 indicates that the documents are exactly the same. The value of 0 on the other hand means that the documents do not share any n-gram.
Any two documents will have the similarity value below the default treshold unless they share a significant part of text.
default: 0.2
The document similarity is based on the information of how many n-grams the documents have in common. An n-gram is a sequence of any n immeadiately subsequent words. For example the text
she sells sea shells on the sea shore
contains following 5-grams:
she sells sea shells on sells sea shells on the sea shells on the sea shells on the sea shore
This attribute specifies the value of n (the size of n-gram).
default: 5
The stoplist is a list of very frequent words for given language (for English e.g. a, the, is, ...). It is a good idea to remove the stoplist words from texts before similarity is computed, because it is quite likely that two documents will share n-grams of frequent words even if they are not similar at all.
The stoplist can be specified both as an array of words and as a name of a file where the words are stored one per line:
$deduper->stoplist('a', 'the', 'is', @next_stopwords); $deduper->stoplist('/path/to/english_stoplist.txt');
Do not worry if you do not have a stoplist for your language. DeDuper will do pretty good job even without the stoplist.
default: empty
For decoding texts in various characters encodings into Perl's internal form.
For n-grams hashing optimisation.
Please report any bugs or feature requests to bug-Text-DeDuper@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-Text-DeDuper@rt.cpan.org
Encode, Encode::Supported, Digest::MD4
http://www.ra.ethz.ch/CDstore/www6/Technical/Paper205/Paper205.html
Contains among other things definition of the resemblance measure.
Jan Pomikalek, <xpomikal@fi.muni.cz>
<xpomikal@fi.muni.cz>
Copyright 2006 Jan Pomikalek, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
=cut found outside a pod block. Skipping to next block.
To install Text::DeDuper, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::DeDuper
CPAN shell
perl -MCPAN -e shell install Text::DeDuper
For more information on module installation, please visit the detailed CPAN module installation guide.