The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::ListScraper - generic web page scraping support

VERSION

Version 0.08

SYNOPSIS

 use HTML::ListScraper;

 $scraper = HTML::ListScraper->new( api_version => 3,
                                    marked_sections => 1 );
 # set up $scraper options...

 $scraper->parse($html);
 $scraper->eof;

 @seq = $scraper->find_sequences;
 $seq = shift @seq;
 if ($seq) { # is-a HTML::ListScraper::Sequence
     foreach $inst ($seq->instances) { # is-a HTML::ListScraper::Instance
         foreach $tag ($inst->tags) { # is-a HTML::ListScraper::Tag
             print "<", $tag->name, ">\n";
             print $tag->text, "\n";
         }
     }
 }

DESCRIPTION

While Perl has good support and is often used for extracting machine-friendly data from HTML pages, most scripts used for that task are ad-hoc, parsing just one site's HTML and depending on superficial, transient details of its structure - and are therefore brittle and labor-intensive to maintain. This module tries to support more generic scraping for a class of pages: those whose most important part is a list of links.

HTML::ListScraper is a subclass of HTML::Parser, building on its ability to convert an octet stream - whether strictly valid HTML or something just vaguely similar to it - to tags and text. HTML parsing works the same as with HTML::Parser, except you don't need to register your own HTML event handlers.

When the document is parsed, call find_sequences to find out which tags in the document repeat, one after the other, more than once (attributes, text and comments are ignored for this comparison). Since there'll probably be quite a lot of such sequences, HTML::ListScraper tries to find the "longest one repeating most often", specifically, it maximizes log(number of non-overlapping runs)*log(number of tags in the sequence). There can obviously be more than one such sequence, which is why the method returns an array (and the array can also be empty - see below). Your application can then iterate over the returned structure to find items of interest.

This module includes a script, scrape, displaying the sequences found by HTML::ListScraper, so that you can see which items your application needs - and if they aren't there, you can try to tweak HTML::ListScraper's settings with the various scrape switches to make it find more.

HTML::ListScraper methods are as follows:

new

HTML::ListScraper's constructor. Passes all its parameters to the superclass and registers HTML::Parser's event handlers start, text and end.

min_count

Numeric threshold for the frequency of found sequences - get_sequences returns only those which repeat at least min_count times. Call without arguments to get the current value, with an argument to set it. Default (as well as the minimal allowed value) is 2.

shapeless

By default, get_sequences returns only "well-shaped" sequences, whose every opening tag is followed by the appropriate closing tag, with an exception for those tags whose closing tag is optional - i.e. <div><br></div> is well-shaped but neither <div><br> nor <br></div> is. Tags which don't need a closing tag are those identified by is_unclosed_tag. Closing tags are paired with the nearest opening tag with the same name which hasn't been paired yet. A well-shaped sequence is basically an HTML fragment - like a tree, except it doesn't have to have a single root.

Well-shaped sequences should be fine when processing valid HTML, but since this module doesn't restrict itself to valid HTML, that isn't always good enough. Setting shapeless to a true value removes this filtering and makes all sequences eligible.

is_unclosed_tag

Test for tag names with optional closing tag. Takes a tag name, returns true for tags declared in HTML 4.01 Transitional DTD as having either optional or no closing tag. Note that subclassing this method won't change HTML::ListScraper behavior - it delegates to a real implementation deep in this module's guts, which are not documented here.

get_all_tags

Accessor for the document's tag sequence maintained by HTML::ListScraper, used mainly for debugging. Takes no arguments, returns an array (array reference if called in a scalar context) of HTML::ListScraper::Tag objects.

get_sequences

The core of HTML::ListScraper. Takes no arguments, returns an array of HTML::ListScraper::Sequence objects. The sequences are sorted by length (shortest first).

"Sequences" with just 1 tag and sequences which don't repeat are never returned; depending on the value of min_count and shapeless, get_sequences may also ignore other ones (see min_count and shapeless).

find_sequences

A generalization of get_sequences. Like get_sequences, find_sequences takes no arguments and returns an array of HTML::ListScraper::Sequence objects - the same sequences, in fact, as get_sequences, but with potentially more instances. In addition to the exact matches, find_sequences tries to find "approximate" instance matches, that is, tag sequences with a non-zero but low edit distance from the exact sequence.

The alignment uses Algorithm::NeedlemanWunsch (q.v.) in its local mode, with fixed scores whose particular values hopefully don't matter much (see the source of HTML::ListScraper::Sweep if you're really interested in them). Approximate instances are sought between the exact ones, from the most similar to a cut-off point of low similarity.

Found approximate instances are identified by HTML::ListScraper::Instance::match value approx. their score is available as the value of HTML::ListScraper::Instance::score. That value isn't always defined, though: if the shapeless flag isn't set, approximate tag sequences are made to look like valid HTML fragments by removing unpaired tags. Since that obviously damages the score, no score is returned for such cut-up instances.

get_known_sequence

When the "longest sequence repeating most often" found by HTML::ListScraper isn't quite the sought one, you can specify exactly which one you want by calling get_known_sequence instead of get_sequences. get_known_sequence takes a list of tag names spelled using the same convention as HTML::ListScraper::Tag, i.e. in lowercase, without angle brackets and with closing tags having '/' as the first character. If the parsed document doesn't contain the specified sequence, get_known_sequence returns undef. Otherwise, it returns an instance of HTML::ListScraper::Sequence.

find_known_sequence

A generalization of get_known_sequence. Like get_known_sequence, find_known_sequence takes a list of tag names and finds both exact and approximate matches for it. If the parsed document doesn't contain at least one at least approximately matching tag sequences, find_known_sequence returns undef. Otherwise, it returns an instance of HTML::ListScraper::Sequence.

on_start

Attribute start handler. Registered with signature self, tagname, attr, although the only attribute preserved by HTML::ListScraper is href. For ultimate flexibility in preprocessing the input HTML, you can subclass this method, but do call the base version at least conditionally. Note that if you want to just ignore some tags, there are simpler ways, i.e. HTML::Parser::ignore_tags.

on_text

Text handler. Registered with signature self, dtext. For ultimate flexibility in preprocessing the input HTML, you can subclass this method, but do call the base version at least conditionally.

on_end

Attribute end handler. Registered with signature self, tagname. For ultimate flexibility in preprocessing the input HTML, you can subclass this method.

BUGS

Requires too much configuration.

AUTHOR

Vaclav Barta, <vbar@comp.cz>

COPYRIGHT & LICENSE

Copyright 2007-2015 Vaclav Barta, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.