Ashley Pond V — 바보 양키스 > WWW-Spyder-0.24 > WWW::Spyder

Download:
WWW-Spyder-0.24.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  1
View Bugs
Report a bug
Module Version: 0.24   Source  

NAME ^

WWW::Spyder - a simple non-persistent web crawler.

VERSION ^

0.24

SYNOPSIS ^

A web spider that returns plain text, HTML, and other information per page crawled and can determine what pages to get and parse based on supplied terms compared to the text in links as well as page content.

 use WWW::Spyder;
 # Supply your own LWP::UserAgent-compatible agent.
 use WWW::Mechanize;

 my $start_url = "http://my-great-domain.com/";
 my $mech = WWW::Mechanize->new(agent => "PreferredAgent/0.01")

 my $spyder = WWW::Spyder->new(
                               report_broken_links => 1,
                               seed                => $start_url,
                               sleep_base          => 5,
                               UA                  => $mech
                );
 while ( my $page = $spyder->crawl ) {
     # do something with the page...
 }

METHODS ^

Sypder::Page methods

Weird courteous behavior

Courtesy didn't used to be weird, but that's another story. You will probably notice that the courtesy routines force a sleep when a recently seen domain is the only choice for a new link. The sleep is partially randomized. This is to prevent the spyder from being recognized in weblogs as a robot.

The web and courtesy

Please, I beg of thee, exercise the most courtesy you can. Don't let impatience get in the way. Bandwidth and server traffic are $MONEY for real. The web is an extremely disorganized and corrupted database at the root but companies and individuals pay to keep it available. The less pain you cause by banging away on a webserver with a web agent, the more welcome the next web agent will be.

Update: Google seems to be excluding generic LWP agents now. See, I told you so. A single parallel robot can really hammer a major server, even someone with as big a farm and as much bandwidth as Google.

VERBOSITY

SAMPLE USAGE ^

See "spyder-mini-bio" in this distribution

It's an extremely simple, but fairly cool pseudo bio-researcher.

Simple continually crawling spyder:

In the following code snippet:

 use WWW::Spyder;

 my $spyder = WWW::Spyder->new( shift || die"Give me a URL!\n" );

 while ( my $page = $spyder->crawl ) {

    print '-'x70,"\n";
    print "Spydering: ", $page->title, "\n";
    print "      URL: ", $page->url, "\n";
    print "     Desc: ", $page->description || 'n/a', "\n";
    print '-'x70,"\n";
    while ( my $link = $page->next_link ) {
        printf "%22s ->> %s\n",
        length($link->name) > 22 ?
            substr($link->name,0,19).'...' : $link->name,
            length($link) > 43 ?
                substr($link,0,40).'...' : $link;
    }
 }

as long as unique URLs are being found in the pages crawled, the spyder will never stop.

Each "crawl" returns a page object which gives the following methods to get information about the page.

Spyder that will give up the ghost...

The following spyder is initialized to stop crawling when either of its conditions are met: 10mins pass or 300 pages are crawled.

 use WWW::Spyder;

 my $url = shift || die "Please give me a URL to start!\n";

 my $spyder = WWW::Spyder->new
      (seed        => $url,
       sleep_base  => 10,
       exit_on     => { pages => 300,
                        time  => '10min', },);

 while ( my $page = $spyder->crawl ) {

    print '-'x70,"\n";
    print "Spydering: ", $page->title, "\n";
    print "      URL: ", $page->url, "\n";
    print "     Desc: ", $page->description || '', "\n";
    print '-'x70,"\n";
    while ( my $link = $page->next_link ) {
        printf "%22s ->> %s\n",
        length($link->name) > 22 ?
            substr($link->name,0,19).'...' : $link->name,
            length($link) > 43 ?
                substr($link,0,40).'...' : $link;
    }
 }

Primitive page reader

 use WWW::Spyder;
 use Text::Wrap;

 my $url = shift || die "Please give me a URL to start!\n";
 @ARGV or die "Please also give me a search term.\n";
 my $spyder = WWW::Spyder->new;
 $spyder->seed($url);
 $spyder->terms(@ARGV);

 while ( my $page = $spyder->crawl ) {
     print '-'x70,"\n * ";
     print $page->title, "\n";
     print '-'x70,"\n";
     print wrap('','', $page->text);
     sleep 60;
 }

TIPS ^

If you are going to do anything important with it, implement some signal blocking to prevent accidental problems and tie your gathered information to a DB_File or some such.

You might want to load POSIX::nice(40). It should top the nice off at your system's max and prevent your spyder from interfering with your system.

You might want to to set $| = 1.

PRIVATE METHODS ^

are private but hack away if you're inclined

TO DO ^

Spyder is conceived to live in a future namespace as a servant class for a complex web research agent with simple interfaces to pre-designed grammars for research reports; or self-designed grammars/reports (might be implemented via Parse::FastDescent if that lazy-bones Conway would just find another 5 hours in the paltry 32 hour day he's presently working).

I'd like the thing to be able to parse RTF, PDF, and perhaps even resource sections of image files but that isn't on the radar right now.

The tests should work differently. Currently they ask for outside resources without checking if there is either an open way to do it or if the user approves of it. Bad form all around.

TO DOABLE BY 1.0 ^

Add 2-4 sample scripts that are a bit more useful.

There are many functions that should be under the programmer's control and not buried in the spyder. They will emerge soon. I'd like to put in hooks to allow the user to keep(), toss(), or exclude(), urls, link names, and domains, while crawling.

Clean up some redundant, sloppy, and weird code. Probably change or remove the AUTOLOAD.

Put in a go_to_seed() method and a subclass, ::Seed, with rules to construct query URLs by search engine. It would be the autostart or the fallback for perpetual spyders that run out of links. It would hit a given or default search engine with the Spyder's terms as the query. Obviously this would only work with terms() defined.

Implement auto-exclusion for failure vs. success rates on names as well as domains (maybe URI suffixes too).

Turn length of courtesy queue into the breadth/depth setting? make it automatically adjusting...?

Consistently found link names are excluded from term strength sorting? Eg: "privacy policy," "read more," "copyright..."

Fix some image tag parsing problems and add area tag parsing.

Configuration for user:password by domain.

::Page objects become reusable so that a spyder only needs one.

::Enqueue objects become indexed so they are nixable from anywhere.

Expand exit_on routines to size, slept time, dwindling success ratio, and maybe more.

Make methods to set "skepticism" and "effort" which will influence the way the terms are used to keep, order, and toss URLs.

BE WARNED ^

This module already does some extremely useful things but it's in its infancy and it is conceived to live in a different namespace and perhaps become more private as a subservient part of a parent class. This may never happen but it's the idea. So don't put this into production code yet. I am endeavoring to keep its interface constant either way. That said, it could change completely.

Also!

This module saves cookies to the user's home. There will be more control over cookies in the future, but that's how it is right now. They live in $ENV{HOME}/spyderCookie.

Anche!

Robot Rules aren't respected. Spyder endeavors to be polite as far as server hits are concerned, but doesn't take "no" for answer right now. I want to add this, and not just by domain, but by page settings.

UNDOCUMENTED FEATURES ^

A.k.a. Bugs. Don't be ridiculous! Bugs in my code?!

There is a bug that is causing retrieval of image src tags, I think but haven't tracked it down yet, as links. I also think the plain text parsing has some problems which will be remedied shortly.

If you are building more than one spyder in the same script they are going to share the same exit_on parameters because it's a self-installing method. This will not always be so.

See Bugs file for more open and past issues.

Let me know if you find any others. If you find one that is platform specific, please send patch code/suggestion because I might not have any idea how to fix it.

WHY Spyder? ^

I didn't want to use the more appropriate Spider because I think there is a better one out there somewhere in the zeitgeist and the namespace future of Spyder is uncertain. It may end up a semi-private part of a bigger family. And I may be King of Kenya someday. One's got to dream.

If you like Spyder, have feedback, wishlist usage, better algorithms/implementations for any part of it, please let me know!

THANKS TO ^

Most all y'all. Especially Lincoln Stein, Gisle Aas, The Conway, Raphael Manfredi, Gurusamy Sarathy, and plenty of others.

COMPARE WITH (PROBABLY PREFER) ^

WWW::Robot, LWP::UserAgent, WWW::SimpleRobot, WWW::RobotRules, LWP::RobotUA, and other kith and kin.

LICENCE AND COPYRIGHT ^

Copyright (c) 2001-2008, Ashley Pond V <ashley@cpan.org>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY ^

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.