HTML::Feature - Extract Feature Sentences From HTML Documents
use HTML::Feature; # simple usage my $feature = HTML::Feature->new; my $result = $feature->parse("http://www.perl.com"); print "Title:" , $result->title, "\n"; print "Description:" , $result->desc , "\n"; print "Featured Text:", $result->text , "\n"; # you can set some engine modules serially. ( if one module can't extract text, it calls to next module ) my $feature = HTML::Feature->new( engines => [ 'HTML::Feature::Engine::LDRFullFeed', 'HTML::Feature::Engine::GoogleADSection', 'HTML::Feature::Engine::TagStructure', ], ); my $result = $feature->parse($url); # And you can set your custom engine module in arbitrary place. my $feature = HTML::Feature->new( engines => [ 'Your::Custom::Engine::Module' ], );
This module extracst blocks of feature sentences out of an HTML document.
Version 3.0, we provide three engines.
1. LDRFullFeed Use wedata's databaase that is compatible for LDR Full Feed. see -> http://wedata.net/help/about ( Japanse only ) 2. GoogleADSection Extract by 'Google AD Section' HTML COMMENT TAG 3. TagStructure Default engine. It guesses and extracts a feature sentence by HTML tag structure. Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing. Because of this, HTML::Feature::Engine::TagStructure has an advantage over other similar modules in that it can be applied to documents in any language.
Instantiates a new HTML::Feature object. Takes the following parameters
my $f = HTML::Feature->new(%param); my $f = HTML::Feature->new( engines => [ class_name1, class_name2, # backend engine module (default: 'TagStructure') class_name3 ], user_agent => 'my-agent-name', # LWP::UserAgent->agent (default: 'libwww-perl/#.##') http_proxy => 'http://proxy:3128', # http proxy server (default: '') timeout => 10, # set the timeout value in seconds. (default: 180) not_decode => 1, # if this value is 1, HTML::Feature does not decode the HTML document (default: '') not_encode => 1, # if this value is 1, HTML::Feature does not encode the result value (default: '') element_flag => 1, # flag of HTML::Element object as returned value (default: '') );
Specifies the class name of the engine that you want to use.
HTML::Feature is designed to accept some different engines. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter.
my $result = $f->parse($url); # or my $result = $f->parse($html_ref,[$url]); # or my $result = $f->parse($http_response);
Parses the given argument. The argument can be either a URL, a string of HTML (must be passed as a scalar reference), or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)
Parses an URL. This method will use LWP::UserAgent to fetch the given url.
Parses a string containing HTML. If you use 'HTML::Feature::Engine::LDRFullFeed', $url will be necessary.
Parses an HTTP::Response object.
accessor method that points to HTML::Feature::FrontParser object.
accessor method that points to HTML::Feature::Engine object.
Takeshi Miki <firstname.lastname@example.org>