NAME

WWW::Spider - flexible Internet spider for fetching and analyzing websites

VERSION

This document describes WWW::Spider version 0.01_10

SYNOPSIS

 #configuration
 my $spider=new WWW::Spider;
 $spider=new WWW::Spider({UASTRING=>"mybot"});
 
 print $spider->uastring;
 $spider->uastring('New UserAgent String');
 $spider->user_agent(new LWP::UserAgent);
 
 #basic stuff
 print $spider->get_page_response('http://search.cpan.org/') -> content;
 print $spider->get_page_content('http://search.cpan.org/');
 $spider->get_links_from('http://google.com/'); #get array of URLs
 
 #registering hooks
 
 #crawling

DESCRIPTION

WWW::Spider is a customizable Internet spider intended to be used for fetching and analyzing websites. Features include:

basic methods for high-level html handling
the manner in which pages are retrieved is customizable
callbacks for when pages are fetched, errors caused, etc...
caching
thread-safe operation, and optional multithreading operation (faster)
a high-level implementation of a 'graph' of either pages or sites (as defined by the callback) which can be analyzed

FUNCTIONS

PARAMETERS

Parameter getting and setting functions.

new WWW::Spider([%params])

Constructor for WWW::Spider

Arguments include:

UASTRING

The useragent string to be used. The default is "WWW::Spider"
USER_AGENT

The LWP::UserAgent to use. If this is specified, the UASTRING argument is ignored.

->user_agent [LWP::UserAgent]

Returns/sets the user agent being used by this object.

->uastring [STRING]

Returns/sets the user agent string being used by this object.

GENERAL

These functions could be implemented anywhere - nothing about what they do is special do WWW::Spider. Mainly, they are just conveiniance functions for the rest of the code.

->get_page_content URL: Returns the contents of the page at URL.
->get_page_response URL: Returns the HTTP::Response object corresponding to URL

SPIDER

These functions implement the spider functionality.

->crawl URL MAX_DEPTH

Crawls URL to the specified maxiumum depth. This is implemented as a breadth-first search.

The default value for MAX_DEPTH is 0.

->handle_url URL

The same as crawl(URL,0).

->crawl_content STRING [$MAX_DEPTH] [$SOURCE]

Treats STRING as if it was encountered during a crawl, with a remaining maximum depth of MAX_DEPTH. The crawl is implemented as a breadth-first search using Thread::Queue.

The default value for MAX_DEPTH is 0.

The assumption is made that handlers have already been called on this page (otherwise, implementation would be impossible).

->handle_response HTTP::RESPONSE

Handles the HTTP reponse, calling the appropriate hooks, without crawling any other pages.

->get_links_from URL

Returns a list of URLs linked to from URL.

->get_links_from_content $CONTENT [$SOURCE]

Returns a list of URLs linked to in STRING. When a URL is discovered that is not complete, it is fixed by assuming that is was found on SOURCE. If there is no source page specified, bad URLs are treated as if they were linked to from http://localhost/.

SOURCE must be a valid and complete url.

CALLBACKS AND HOOKS

All hook registration and deletion functions are considered atomic. If five hooks have been registered, and then all of them are deleted in one operation, there will be no page for which fewer than five but more than zero of those hooks are called (unless some hooks are added afterwords).

The legal hook strings are:

handle-page

Called whenever a crawlable page is reached.

Arguments: CONTENT, URL

Return:
handle-response

Called on an HTTP response, successfull, crawlable, or otherwise.

Arguments:

Return:
handle-failure

Called on any failed HTTP response.

Arguments:

Return:

Functions for handling callbacks are:

->call_hooks HOOK-STRING, @ARGS

Calls all of the registered HOOK-STRING callbacks with @ARGS. This function returns a list of all of the return values (in some unspecified order) which are to be handled appropriately by the caller.

->register_hook HOOK-STRING, SUB, [{OPTIONS}]

Registers a subroutine to be run on HOOK-STRING. Has no return value. Valid options are:

FORK

Set to a non-zero value if you want this hook to be run in a separate thread. This means that, among other things, the return value will not have the same affect (or even a well defined affect).

->get_hooks [HOOK-STRING]

Returns all hooks corresponding to HOOK-STRING. If HOOK-STRING is not given, returns all hooks.

->clear_hooks [HOOK-STRING]

Removes all hooks corresponding to HOOK-STRING. If HOOK-STRING is not given, it deletes all hooks.

BUGS AND LIMITATIONS

Hooks are not yet fully implemented.
Hook list modifications are not atomic

MODULE DEPENDENCIES

WWW::Spider depends on several other modules that allow it to get and parse HTML code. Currently used are:

Carp
LWP::UserAgent
HTTP::Request
Thread::Queue
Thread::Resource::RWLock

Other modules will likely be added to this list in the future. Candidates are:

HTML::*

AUTHOR

WWW::Spider is written and maintained by Scott Lawrence (bytbox@gmail.com)

COPYRIGHT AND LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install WWW::Spider, copy and paste the appropriate command in to your terminal.

cpanm

cpanm WWW::Spider

CPAN shell

perl -MCPAN -e shell
install WWW::Spider

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)