The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

webreaper -- download a web page and its links

SYNOPSIS

        webreaper [OPTIONS] URL

DESCRIPTION

THIS IS ALPHA SOFTWARE

The webreaper program downloads web sites. It creates a directory, named after the host of the URL given on the command line, in the current working directory, and will optionally create a tarball of it.

Getting around web site misfeatures

This script has many features to make it look like a normal, interaction web browser. You can set values for some features, or use the defaults, enumerated later.

Set the user-agent string with the -a switch. Some web sites refuse to work with certain browsers because they want you to use Internet Explorer. While webreaper is not subject to javascript checks (except for ones that try to redirect you), some servers try that behind-the-scenes.

Set the referer [sic] string. Some sites limit what you can see based on how they think you got to the address (i.e. they want you to click on a certain link). The script automatically sets the referer strings for links it finds in web pages, but you can set the referer for the first link (the one you specify on the command line) with the -r switch.

Basic browser features

For websites that use a login and password, use the -u and -p switches. This feature is still a bit broken because it sends the authorization string for every address.

Script features

Watch the action by turning on verbose messages with the -v switch. If you run this script from another script, cron, or some other automated method, you probably want no output, so do not use -v. You can also set the WEBREAPER_VERBOSE environment variable.

To get even more output, use the -d switch to turn on debugging output. You can also set the WEBREAPER_DEBUG varaible.

You can create a single file of everything that you download by creating an archive with the -t switch, which creates a tarball.

The script limits its traversal to URLs below the starting URL. This may change in the future.

Command line switches

-a USER_AGENT

set the user agent string

-e

list of file extensions to store (not yet implemented)

-E

list of file extensions to skip (not yet implemented)

-d

turn on debugging output

-D DIRECTORY

use this directory for downloads

-f

store all files in the same directory (flat)

-h HOST1[,HOST2...]

allowed hosts, comma separated.

-n NUMBER

stop after requesting NUMBER resources, whether or not webreaper stored them

-N NUMBER

stop after storing NUMBER resources

-r REFERER_URL

referer for the first URL

-p PASSWORD

password for basic auth

-s SECONDS

sleep between requests

-t

create tar archive

-u USERNAME

username for basic auth

-v

verbose ouput

-z

create a zip archive

Examples

scrape a site, with a randomizing pause between requests

webreaper -s 10 http://www.example.com

make a tar archive

webreaper -t http://www.example.com

make a zip archive

webreaper -z http://www.example.com

make a tar and a zip archive

webreaper -t -z http://www.example.com

set the user agent string

webreaper -a "Mozilla 19.2 (Sony PlayStation)" http://www.example.com

stop after making 10 requests or storing 5 files, whichever comes first

webreaper -N 5 -n 10 http://www.example.com

Environment variables

WEBREAPER_DEBUG

Show debugging output (implies verbose output). This is the same as the -d switch.

WEBREAPER_VERBOSE

Show progress information. This is the same as the -v switch.

WEBREAPER_DIR

Store downloads in this directory. Script uses the current working directory if this directory does not exist. This is the same as the -D switch.

Wish list

limit directory level
limit content types, file names to store
specify a set of patterns to ignore
do conditional GETs
Tk or curses interface?
create an error log, report, or something
download stats (clock time, storage space, etc)
multiple levels of verbosity for output
read items from a config file
allow user to add/delete allowed domains during runtime
ensure that path names are safe (i.e. no ..)

SEE ALSO

lwp-rget (comes with LWP)

SOURCE AVAILABILITY

This source is part of a SourceForge project which always has the latest sources in CVS, as well as all of the previous releases.

        http://sourceforge.net/projects/brian-d-foy/

If, for some reason, I disappear from the world, one of the other members of the project can shepherd this module appropriately.

AUTHOR

brian d foy, <bdfoy@cpan.org>

COPYRIGHT

Copyright 2003-4, brian d foy, All rights reserved.

You may use this program under the same terms as Perl itself.