SWISH::Prog::Aggregator::Spider - web aggregator
use SWISH::Prog::Aggregator::Spider; my $spider = SWISH::Prog::Aggregator::Spider->new( indexer => SWISH::Prog::Indexer->new ); $spider->indexer->start; $spider->crawl( 'http://swish-e.org/' ); $spider->indexer->finish;
SWISH::Prog::Aggregator::Spider is a web crawler similar to the spider.pl script in the Swish-e 2.4 distribution. Internally, SWISH::Prog::Aggregator::Spider uses LWP::RobotUA to do the hard work. See SWISH::Prog::Aggregator::Spider::UA.
All params have their own get/set methods too. They include:
Get/set the user-agent string reported by the user agent.
Get/set the email string reported by the user agent.
Flag as to whether each URI's content should be fingerprinted and compared. Useful if the same content is available under multiple URIs and you only want to index it once.
Get/set the SWISH::Prog::Cache-derived object used to track which URIs have been fetched already.
If use_md5() is true, this SWISH::Prog::Cache-derived object tracks the URI fingerprints.
Get/set the SWISH::Prog::Queue-derived object for tracking which URIs still need to be fetched.
Get/set the SWISH::Prog::Aggregagor::Spider::UA object.
How many levels of links to follow. NOTE: This value describes the number of links from the first argument passed to crawl.
Default is unlimited depth.
This optional key will set the max minutes to spider. Spidering for this host will stop after
max_time seconds, and move on to the next server, if any. The default is to not limit by time.
This optional key sets the max number of files to spider before aborting. The default is to not limit by number of files. This is the number of requests made to the remote server, not the total number of files to index (see
max_indexed). This count is displayted at the end of indexing as
This feature can (and perhaps should) be use when spidering a web site where dynamic content may generate unique URLs to prevent run-away spidering.
This optional key sets the max size of a file read from the web server. This defaults to 5,000,000 bytes. If the size is exceeded the resource is truncated per LWP::UserAgent.
Set max_size to zero for unlimited size.
This optional parameter will skip any URIs that do not report having been modified since date. The
Last-Modified HTTP header is used to determine modification time.
This optional parameter will enable keep alive requests. This can dramatically speed up spidering and reduce the load on server being spidered. The default is to not use keep alives, although enabling it will probably be the right thing to do.
To get the most out of keep alives, you may want to set up your web server to allow a lot of requests per single connection (i.e MaxKeepAliveRequests on Apache). Apache's default is 100, which should be good.
When a connection is not closed the spider does not wait the "delay" time when making the next request. In other words, there is no delay in requesting documents while the connection is open.
Note: you must have at least libwww-perl-5.53_90 installed to use this feature.
Get/set the number of seconds to wait between making requests. Default is 5 seconds (a very friendly delay).
Get/set the number of seconds to wait before considering the remote server unresponsive. The default is 10.
CODE reference to fetch username/password credentials when necessary. See also
Number of seconds to wait before skipping manual prompt for username/password.
password pair to be used when prompted by the server.
By default, 3xx responses from the server will be followed when they are on the same hostname. Set to false (0) to not follow redirects.
Microsoft server hack.
ARRAY ref of hostnames to be treated as identical to the original host being spidered. By default the spider will not follow links to different hosts.
Initializes a new spider object. Called by new().
Returns true if uri is acceptable for including in an index. The 'ok-ness' of the uri is based on its base, robot rules, and the spider configuration.
Add uri to the queue.
Return next uri from queue.
Returns the next URI from the queue() as a SWISH::Prog::Doc object, or the error message if there was one.
Returns undef if the queue is empty or max_depth() has been reached.
Called internally when the server returns a 401 or 403 response. Will attempt to determine the correct credentials for uri based on the previous attempt in response and what you have configured in credentials, authn_callback or when manually prompted.
Called internally to perform naive heuristics on http_response to determine whether it looks like an XML feed of some kind, rather than a HTML page.
Implements the required crawl() method. Recursively fetches uri and its child links to a depth set in max_depth().
Will quit after max_files() unless max_files==0.
Will quit after max_time() seconds unless max_time==0.
Passes args to SWISH::Prog::Utils::write_log().
Pass through to SWISH::Prog::Utils::write_log_line().
Peter Karman, <email@example.com>
Please report any bugs or feature requests to
bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
You can also look for information at:
Copyright 2008-2009 by Peter Karman
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.