WWW::SimpleRobot - a simple web robot for recursively following links on web pages.
use WWW::SimpleRobot; my $robot = WWW::SimpleRobot->new( URLS => [ 'http://www.perl.org/' ], FOLLOW_REGEX => "^http://www.perl.org/", DEPTH => 1, TRAVERSAL => 'depth', VISIT_CALLBACK => sub { my ( $url, $depth, $html, $links ) = @_; print STDERR "Visiting $url\n"; print STDERR "Depth = $depth\n"; print STDERR "HTML = $html\n"; print STDERR "Links = @$links\n"; } , BROKEN_LINK_CALLBACK => sub { my ( $url, $linked_from, $depth ) = @_; print STDERR "$url looks like a broken link on $linked_from\n"; print STDERR "Depth = $depth\n"; } ); $robot->traverse; my @urls = @{$robot->urls}; my @pages = @{$robot->pages}; for my $page ( @pages ) { my $url = $page->{url}; my $depth = $page->{depth}; my $modification_time = $page->{modification_time}; }
A simple perl module for doing robot stuff. For a more elaborate interface, see WWW::Robot. This version uses LWP::Simple to grab pages, and HTML::LinkExtor to extract the links from them. Only href attributes of anchor tags are extracted. Extracted links are checked against the FOLLOW_REGEX regex to see if they should be followed. A HEAD request is made to these links, to check that they are 'text/html' type pages.
This robot doesn't respect the Robot Exclusion Protocol (http://info.webcrawler.com/mak/projects/robots/norobots.html) (naughty robot!), and doesn't do any exception handling if it can't get pages - it just ignores them and goes on to the next page!
Ave Wrigley <Ave.Wrigley@itn.co.uk>
Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install WWW::SimpleRobot, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WWW::SimpleRobot
CPAN shell
perl -MCPAN -e shell install WWW::SimpleRobot
For more information on module installation, please visit the detailed CPAN module installation guide.