Alexander Borisov > HTML-Content-Extractor-0.12 > HTML::Content::Extractor

Download:
HTML-Content-Extractor-0.12.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Module Version: 0.12   Source   Latest Release: HTML-Content-Extractor-0.17

NAME ^

HTML::Content::Extractor - Recieving a main text of publication from HTML page and main media content that is bound to the text

SYNOPSIS ^

 my $obj = HTML::Content::Extractor->new();
 $obj->analyze($html);
 
 my $main_text   = $obj->get_main_text();
 my $main_images = $obj->get_main_images();
 
 print $main_text, "\n\n";
 
 print "Images:\n";
 foreach my $url (@$main_images) {
        print $url, "\n";
 }

DESCRIPTION ^

This module analyzes an HTML document and extracts the main text (for example front page article contents on the news site) and all related images.

METHODS ^

new

 my $obj = HTML::Content::Extractor->new();

Creates and prepares the structure for the subsequent analysis and parsing HTML.

analyze

 $obj->analyze($html);

Creates an HTML document tree and analyzes it.

get_main_text

 # UTF-8
 my $main_text = $obj->get_main_text(1);
 # or not
 my $main_text = $obj->get_main_text(0);
 # default UTF-8 is on

Return plain text.

get_main_images

 # UTF-8
 my $main_images = $obj->get_main_images(1);
 # or not
 my $main_images = $obj->get_main_images(0);
 # default UTF-8 is on

Returns ARRAY with pictures URL.

DESTROY ^

 undef $obj;

Cleaning of all internal structures (HTML tree and other)

AUTHOR ^

Alexander Borisov <lex.borisov@gmail.com>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2013 by Alexander Borisov.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: