Ave Wrigley > WWW-SimpleRobot > WWW::SimpleRobot

Download:
WWW-SimpleRobot-0.07.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  1
View/Report Bugs
Module Version: 0.07   Source  

NAME ^

WWW::SimpleRobot - a simple web robot for recursively following links on web pages.

SYNOPSIS ^

    use WWW::SimpleRobot;
    my $robot = WWW::SimpleRobot->new(
        URLS            => [ 'http://www.perl.org/' ],
        FOLLOW_REGEX    => "^http://www.perl.org/",
        DEPTH           => 1,
        TRAVERSAL       => 'depth',
        VISIT_CALLBACK  => 
            sub { 
                my ( $url, $depth, $html, $links ) = @_;
                print STDERR "Visiting $url\n"; 
                print STDERR "Depth = $depth\n"; 
                print STDERR "HTML = $html\n"; 
                print STDERR "Links = @$links\n"; 
            }
        ,
        BROKEN_LINK_CALLBACK  => 
            sub { 
                my ( $url, $linked_from, $depth ) = @_;
                print STDERR "$url looks like a broken link on $linked_from\n"; 
                print STDERR "Depth = $depth\n"; 
            }
    );
    $robot->traverse;
    my @urls = @{$robot->urls};
    my @pages = @{$robot->pages};
    for my $page ( @pages )
    {
        my $url = $page->{url};
        my $depth = $page->{depth};
        my $modification_time = $page->{modification_time};
    }

DESCRIPTION ^

    A simple perl module for doing robot stuff. For a more elaborate interface,
    see WWW::Robot. This version uses LWP::Simple to grab pages, and
    HTML::LinkExtor to extract the links from them. Only href attributes of
    anchor tags are extracted. Extracted links are checked against the
    FOLLOW_REGEX regex to see if they should be followed. A HEAD request is
    made to these links, to check that they are 'text/html' type pages. 

BUGS ^

    This robot doesn't respect the Robot Exclusion Protocol
    (http://info.webcrawler.com/mak/projects/robots/norobots.html) (naughty
    robot!), and doesn't do any exception handling if it can't get pages - it
    just ignores them and goes on to the next page!

AUTHOR ^

Ave Wrigley <Ave.Wrigley@itn.co.uk>

COPYRIGHT ^

Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: