README - metacpan.org

HTML-ContentExtractor

version 0.02

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

INSTALLATION

To install this module, run the following commands:

    perl Makefile.PL
    make
    make test
    make install


SUPPORT AND DOCUMENTATION

After installing, you can find documentation for this module with the perldoc command.

    perldoc HTML::ContentExtractor

You can also look for information at:

    Search CPAN
        http://search.cpan.org/dist/HTML-ContentExtractor

    CPAN Request Tracker:
        http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ContentExtractor

    AnnoCPAN, annotated CPAN documentation:
        http://annocpan.org/dist/HTML-ContentExtractor

    CPAN Ratings:
        http://cpanratings.perl.org/d/HTML-ContentExtractor

COPYRIGHT AND LICENCE

Copyright (C) 2007 Zhang Jun

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)