Jean Tavernier > HTML-Content-Extractor > ContentExtractorDriver.pl

Download:
HTML-Content-Extractor-0.01.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View Bugs
Report a bug
Source  

NAME ^

ContentExtractorDriver.pl - Driver for HTML Content Extractor

SYNOPSIS ^

  perl ContentExtractorDriver.pl <input file> <output file> <Ratio type>

DESCRIPTION ^

ContentExtractorDriver.pl attempts to extract the content from HTML documents. It attempts to remove tags, scripts and boilerplate text from the documents by trying to find the region of the HTML document that has the maximum ratio of words to tags.

AUTHOR ^

Jean Tavernier (jj.tavernier@gmail.com)

COPYRIGHT ^

Copyright 2005 Jean Tavernier. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO ^

HTML::Content::ContentExtractor(3).