The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Spyder - a simple non-persistent web crawler.

VERSION

0.24

SYNOPSIS

A web spider that returns plain text, HTML, and other information per page crawled and can determine what pages to get and parse based on supplied terms compared to the text in links as well as page content.

 use WWW::Spyder;
 # Supply your own LWP::UserAgent-compatible agent.
 use WWW::Mechanize;

 my $start_url = "http://my-great-domain.com/";
 my $mech = WWW::Mechanize->new(agent => "PreferredAgent/0.01")

 my $spyder = WWW::Spyder->new(
                               report_broken_links => 1,
                               seed                => $start_url,
                               sleep_base          => 5,
                               UA                  => $mech
                );
 while ( my $page = $spyder->crawl ) {
     # do something with the page...
 }

METHODS

  • $spyder->new()

    Construct a new spyder object. Without at least the seed() set, or go_to_seed() turned on, the spyder isn't ready to crawl.

     $spyder = WWW::Spyder->new(shift||die"Gimme a URL!\n");
        # ...or...
     $spyder = WWW::Spyder->new( %options );

    Options include: sleep_base (in seconds), exit_on (hash of methods and settings), report_broken_links, image_checking (verifies the images pointed to by <img src=...> tags, disable_cnap (disables the courtesy nap when verbose output is enabled), and UA (you can pass in an instantiated LWP::UserAgent object via UA, i.e. UA => $ua_obj). Examples below.

  • $spyder->seed($url)

    Adds a URL (or URLs) to the top of the queues for crawling. If the spyder is constructed with a single scalar argument, that is considered the seed.

  • $spyder->bell([bool])

    This will print a bell ("\a") to STDERR on every successfully crawled page. It might seem annoying but it is an excellent way to know your spyder is behaving and working. True value turns it on. Right now it can't be turned off.

  • $spyder->spyder_time([bool])

    Returns raw seconds since Spyder was created if given a boolean value, otherwise returns "D day(s) HH::MM:SS."

  • $spyder->terms([list of terms to match])

    The more terms, the more the spyder is going to grasp at. If you give a straight list of strings, they will be turned into very open regexes. E.g.: "king" would match "sulking" and "kinglet" but not "King." It is case sensitive right now. If you want more specific matching or different behavior, pass your own regexes instead of strings.

        $spyder->terms( qr/\bkings?\b/i, qr/\bqueens?\b/i );

    terms() is only settable once right now, then it's a done deal.

  • $spyder->spyder_data()

    A comma formatted number of kilobytes retrieved so far. Don't give it an argument. It's a set/get routine.

  • $spyder->slept()

    Returns the total number of seconds the spyder has slept while running. Useful for getting accurate page/time counts (spyder performance) discounting the added courtesy naps.

  • $spyder->UA->...

    The user agent. It should be an LWP::UserAgent or a well-behaved subclass like WWW::Mechanize. Here are the initialized values you might want to tweak-

        $spyder->UA->timeout(30);
        $spyder->UA->max_size(250_000);
        $spyder->UA->agent('Mozilla/5.0');

    Changing the agent name can hurt your spyder because some servers won't return content unless it's requested by a "browser" they recognize.

    You should probably add your email with from() as well.

        $spyder->UA->from('bluefintuna@fish.net');
  • $spyder->cookie_file([local_file])

    They live in $ENV{HOME}/spyderCookie by default but you can set your own file if you prefer or want to save different cookie files for different spyders.

  • $spyder->get_broken_links

    Returns a reference to a list of broken link URLs if report_broken_links was was enabled in the constructor.

  • $spyder->go_to_seed

  • $spyder->queue_count

  • $spyder->show_attributes

  • $spyder->spydered

  • $spyder->crawl

    Returns (and removes) a Spyder page object from the queue of spydered pages.

Sypder::Page methods

  • $page->title

  • $page->text

  • $page->raw

  • $page->url

  • $page->domain

  • $page->link_name

  • $page->link

  • $page->description

  • $page->pages_enQs

Weird courteous behavior

Courtesy didn't used to be weird, but that's another story. You will probably notice that the courtesy routines force a sleep when a recently seen domain is the only choice for a new link. The sleep is partially randomized. This is to prevent the spyder from being recognized in weblogs as a robot.

The web and courtesy

Please, I beg of thee, exercise the most courtesy you can. Don't let impatience get in the way. Bandwidth and server traffic are $MONEY for real. The web is an extremely disorganized and corrupted database at the root but companies and individuals pay to keep it available. The less pain you cause by banging away on a webserver with a web agent, the more welcome the next web agent will be.

Update: Google seems to be excluding generic LWP agents now. See, I told you so. A single parallel robot can really hammer a major server, even someone with as big a farm and as much bandwidth as Google.

VERBOSITY

  • $spyder->verbosity([1-6]) -OR-

  • $WWW::Spyder::VERBOSITY = ...

    Set it from 1 to 6 right now to get varying amounts of extra info to STDOUT. It's an uneven scale and will be straightened out pretty soon. If kids have a preference for sending the info to STDERR, I'll do that. I might anyway.

SAMPLE USAGE

See "spyder-mini-bio" in this distribution

It's an extremely simple, but fairly cool pseudo bio-researcher.

Simple continually crawling spyder:

In the following code snippet:

 use WWW::Spyder;

 my $spyder = WWW::Spyder->new( shift || die"Give me a URL!\n" );

 while ( my $page = $spyder->crawl ) {

    print '-'x70,"\n";
    print "Spydering: ", $page->title, "\n";
    print "      URL: ", $page->url, "\n";
    print "     Desc: ", $page->description || 'n/a', "\n";
    print '-'x70,"\n";
    while ( my $link = $page->next_link ) {
        printf "%22s ->> %s\n",
        length($link->name) > 22 ?
            substr($link->name,0,19).'...' : $link->name,
            length($link) > 43 ?
                substr($link,0,40).'...' : $link;
    }
 }

as long as unique URLs are being found in the pages crawled, the spyder will never stop.

Each "crawl" returns a page object which gives the following methods to get information about the page.

  • $page->links

    URLs found on the page.

  • $page->title

    Page's <TITLE> Title </TITLE> if there is one.

  • $page->text

    The parsed plain text out of the page. Uses HTML::Parser and tries to ignore non-readable stuff like comments and scripts.

  • $page->url

  • $page->domain

  • $page->raw

    The content returned by the server. Should be HTML.

  • $page->description

    The META description of the page if there is one.

  • $page->links

    Returns a list of the URLs in the page. Note: next_link() will shift the available list of links() each time it's called.

  • $link = $page->next_link

    next_link() destructively returns the next URI-ish object in the page. They are objects with three accessors.

  • $link->url

    This is also overloaded so that interpolating "$link" will get the URL just as the method does.

  • $link->name

  • $link->domain

Spyder that will give up the ghost...

The following spyder is initialized to stop crawling when either of its conditions are met: 10mins pass or 300 pages are crawled.

 use WWW::Spyder;

 my $url = shift || die "Please give me a URL to start!\n";

 my $spyder = WWW::Spyder->new
      (seed        => $url,
       sleep_base  => 10,
       exit_on     => { pages => 300,
                        time  => '10min', },);

 while ( my $page = $spyder->crawl ) {

    print '-'x70,"\n";
    print "Spydering: ", $page->title, "\n";
    print "      URL: ", $page->url, "\n";
    print "     Desc: ", $page->description || '', "\n";
    print '-'x70,"\n";
    while ( my $link = $page->next_link ) {
        printf "%22s ->> %s\n",
        length($link->name) > 22 ?
            substr($link->name,0,19).'...' : $link->name,
            length($link) > 43 ?
                substr($link,0,40).'...' : $link;
    }
 }

Primitive page reader

 use WWW::Spyder;
 use Text::Wrap;

 my $url = shift || die "Please give me a URL to start!\n";
 @ARGV or die "Please also give me a search term.\n";
 my $spyder = WWW::Spyder->new;
 $spyder->seed($url);
 $spyder->terms(@ARGV);

 while ( my $page = $spyder->crawl ) {
     print '-'x70,"\n * ";
     print $page->title, "\n";
     print '-'x70,"\n";
     print wrap('','', $page->text);
     sleep 60;
 }

TIPS

If you are going to do anything important with it, implement some signal blocking to prevent accidental problems and tie your gathered information to a DB_File or some such.

You might want to load POSIX::nice(40). It should top the nice off at your system's max and prevent your spyder from interfering with your system.

You might want to to set $| = 1.

PRIVATE METHODS

are private but hack away if you're inclined

TO DO

Spyder is conceived to live in a future namespace as a servant class for a complex web research agent with simple interfaces to pre-designed grammars for research reports; or self-designed grammars/reports (might be implemented via Parse::FastDescent if that lazy-bones Conway would just find another 5 hours in the paltry 32 hour day he's presently working).

I'd like the thing to be able to parse RTF, PDF, and perhaps even resource sections of image files but that isn't on the radar right now.

The tests should work differently. Currently they ask for outside resources without checking if there is either an open way to do it or if the user approves of it. Bad form all around.

TO DOABLE BY 1.0

Add 2-4 sample scripts that are a bit more useful.

There are many functions that should be under the programmer's control and not buried in the spyder. They will emerge soon. I'd like to put in hooks to allow the user to keep(), toss(), or exclude(), urls, link names, and domains, while crawling.

Clean up some redundant, sloppy, and weird code. Probably change or remove the AUTOLOAD.

Put in a go_to_seed() method and a subclass, ::Seed, with rules to construct query URLs by search engine. It would be the autostart or the fallback for perpetual spyders that run out of links. It would hit a given or default search engine with the Spyder's terms as the query. Obviously this would only work with terms() defined.

Implement auto-exclusion for failure vs. success rates on names as well as domains (maybe URI suffixes too).

Turn length of courtesy queue into the breadth/depth setting? make it automatically adjusting...?

Consistently found link names are excluded from term strength sorting? Eg: "privacy policy," "read more," "copyright..."

Fix some image tag parsing problems and add area tag parsing.

Configuration for user:password by domain.

::Page objects become reusable so that a spyder only needs one.

::Enqueue objects become indexed so they are nixable from anywhere.

Expand exit_on routines to size, slept time, dwindling success ratio, and maybe more.

Make methods to set "skepticism" and "effort" which will influence the way the terms are used to keep, order, and toss URLs.

BE WARNED

This module already does some extremely useful things but it's in its infancy and it is conceived to live in a different namespace and perhaps become more private as a subservient part of a parent class. This may never happen but it's the idea. So don't put this into production code yet. I am endeavoring to keep its interface constant either way. That said, it could change completely.

Also!

This module saves cookies to the user's home. There will be more control over cookies in the future, but that's how it is right now. They live in $ENV{HOME}/spyderCookie.

Anche!

Robot Rules aren't respected. Spyder endeavors to be polite as far as server hits are concerned, but doesn't take "no" for answer right now. I want to add this, and not just by domain, but by page settings.

UNDOCUMENTED FEATURES

A.k.a. Bugs. Don't be ridiculous! Bugs in my code?!

There is a bug that is causing retrieval of image src tags, I think but haven't tracked it down yet, as links. I also think the plain text parsing has some problems which will be remedied shortly.

If you are building more than one spyder in the same script they are going to share the same exit_on parameters because it's a self-installing method. This will not always be so.

See Bugs file for more open and past issues.

Let me know if you find any others. If you find one that is platform specific, please send patch code/suggestion because I might not have any idea how to fix it.

WHY Spyder?

I didn't want to use the more appropriate Spider because I think there is a better one out there somewhere in the zeitgeist and the namespace future of Spyder is uncertain. It may end up a semi-private part of a bigger family. And I may be King of Kenya someday. One's got to dream.

If you like Spyder, have feedback, wishlist usage, better algorithms/implementations for any part of it, please let me know!

THANKS TO

Most all y'all. Especially Lincoln Stein, Gisle Aas, The Conway, Raphael Manfredi, Gurusamy Sarathy, and plenty of others.

COMPARE WITH (PROBABLY PREFER)

WWW::Robot, LWP::UserAgent, WWW::SimpleRobot, WWW::RobotRules, LWP::RobotUA, and other kith and kin.

LICENCE AND COPYRIGHT

Copyright (c) 2001-2008, Ashley Pond V <ashley@cpan.org>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.