The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
HTML-ContentExtractor

version 0.02

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

INSTALLATION

To install this module, run the following commands:

    perl Makefile.PL
    make
    make test
    make install


SUPPORT AND DOCUMENTATION

After installing, you can find documentation for this module with the perldoc command.

    perldoc HTML::ContentExtractor

You can also look for information at:

    Search CPAN
        http://search.cpan.org/dist/HTML-ContentExtractor

    CPAN Request Tracker:
        http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ContentExtractor

    AnnoCPAN, annotated CPAN documentation:
        http://annocpan.org/dist/HTML-ContentExtractor

    CPAN Ratings:
        http://cpanratings.perl.org/d/HTML-ContentExtractor

COPYRIGHT AND LICENCE

Copyright (C) 2007 Zhang Jun

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.