HTML::Similarity - Calculate the structural similarity between two HTML documents
use HTML::Similarity; my $hs = new HTML::Similarity; my $a = "<html><body></body></html>"; my $b = "<html><body><h1>HOMEPAGE</h1><h2>Details</h2></body></html>"; my $score = $hs->calculate_similarity($a, $b); print "Similarity: $score\n";
This module is a small and handy tool to calculate structural similarity between any two HTML documents. The underlying algorithm is quite simple and straight-forward. It serializes two HTML tree to two arrays containing node's tag names and finds the longest common sequence between the two serialized arrays.
The similarity is measured with the formula (2 * LCS' length) / (treeA's length + treeB's length).
Structural similarity can be useful for web page classification and clustering.
HTML::DOM, Algorithm::LCS
Copyright (c) 2011 Yung-chung Lin.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::Similarity, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Similarity
CPAN shell
perl -MCPAN -e shell install HTML::Similarity
For more information on module installation, please visit the detailed CPAN module installation guide.