Name
WWW::PDAScraper - Class for scraping PDA-friendly content from websites
Synopsis
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new qw ( NewScientist Yahoo::Entertainment );
$scraper->scrape();
or
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new;
$scraper->scrape qw( NewScientist Yahoo::Entertainment );
or
perl -MWWW::PDAScraper -e "scrape qw( NewScientist Yahoo::Entertainment )"
Description
Having written various kludgey scripts to download PDA-friendly content
from various websites, I decided to try and write a generalised solution
which would
* parse out the section of a news page which contains the links we want
* munge those links into the URL for the print-friendly version, if
possible
* download those pages and make an index page for them
The moving of the pages to your PDA is not part of the scope of the
module: the open-source browser and "distiller", Plucker, from
http://plkr.org/ is recommended. Just get it to read the index.html file
with a depth of 1 from disk, using a URL like file:///path/to/index.html
The Sub-modules
WWW::PDAScraper uses a set of rules for scraping a particular website
from a second module, i.e. "WWW::PDAScraper::Yahoo::Entertainment::TV"
contains the rules for scraping the Yahoo TV News website:
package WWW::PDAScraper::Yahoo::Entertainment::TV;
# WWW::PDAScraper.pm rules for scraping the
# Yahoo TV website
sub config {
return {
name => 'Yahoo TV',
start_from => 'http://news.yahoo.com/i/763',
chunk_spec => [ "_tag", "div", "id", "indexstories" ],
url_regex => [ '$', '&printer=1' ]
};
}
1;
A more or less random selection of modules is included, as well as a
full set for Yahoo, to demonstrate a logical set of modules in
categories.
Creating a new sub-module ought to be relatively simple, see the
template provided, WWW::PDAScraper::Template.pm - you need "name",
"start_from", then either "chunk_spec" or "url_spec", then optionally a
"url_regex" for transformation into the print-friendly URL.
Then either move your new module to the same location as the other ones
on your system, or make sure they're available to your script with a
line like "use lib '/path/to/local/modules/PDAScraper/'"
USAGE
WWW::PDAScraper ought to be very simple to run, assuming you have the
right sub-module(s).
It only has two main methods, new() and scrape(), and two supplementary
ones, for assigning a proxy server to the user-agent and one for
over-riding the default download location.
Either object-oriented, loading the sub-module(s) as part of "new":
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new qw ( NewScientist Yahoo::Entertainment );
$scraper->scrape();
or object-oriented, loading the sub-module(s) as part of each call to
scrape():
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new;
$scraper->scrape qw( NewScientist Yahoo::Entertainment );
$scraper->scrape qw( SomethingElse );
or procedural:
use WWW::PDAScraper;
scrape qw( NewScientist Yahoo::Entertainment );
or from the command line:
perl -MWWW::PDAScraper -e "scrape qw( NewScientist Yahoo::Entertainment )"
The only extras involved would be adding a proxy to the user-agent
and/or over-riding the default download location of $ENV{'HOME'}/scrape/
Object-oriented:
use WWW::PDAScraper;
my $scraper = WWW::PDAScraper->new;
$scraper->proxy('http://your.proxy.server:port/');
$scraper->download_location("/path/to/folder/");
procedural:
use WWW::PDAScraper;
proxy('http://your.proxy.server:port/');
download_location("/path/to/folder/");
I wish I didn't need this code
In the days of modern web publishing, I shouldn't need to create this
code. All websites should make themselves PDA-friendly by the use of
client detection or smart CSS or XML. But they don't.
Bugs
The websites will certainly change, and at that time the sub-modules
will stop working. There's no way around that.
Obviously it would be useful if there were a developer/user community
which contributed new modules and updated the old ones.
See Also
HTML::Element, for the syntax of "chunk_spec" in sub-modules.
To do
The user-agent should really be part of the object, I guess. That would
be neater.
And it should actually use WWW::Robot instead of LWP so it doesn't
hammer servers.
And we could either add arbitrary numbers of regexes for fixing up the
pages of sites which don't have a print-friendly version of the page, or
add a second level of parsing to find the print-friendly link, for sites
which don't have a logical relationship between the regular link and the
print-friendly.
Author
John Horner
CPAN ID: CODYP
bounce@johnhorner.nu
http://pdascraper.johnhorner.nu/
Copyright
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.