dezibot - parallel web crawler
# crawl 2 sites % dezibot http://dezi.org http://swish-e.org # crawl a list of sites % dezibot --urls file_with_urls # pass in stored config % dezibot --config botconfig.pl # crawl in parallel % dezibot --workers 5 --urls file_with_urls
dezibot is a command line tool wrapping the Dezi::Bot module.
The following options are supported.
Print this message.
Spew lots of information to stderr. Overrides any setting in --config.
Print some status information to stderr. Overrides any setting in --config.
Read config from file using Config::Any. The parsed config is passed directly to Dezi::Bot->new().
Read URLs to crawl from file. Lines starting with whitespace or
# are ignored.
Spawn n workers to crawl in parallel. The default is to crawl serially. If n is less than the number of URLs, the list of URLs will be sliced and apportioned among the n workers according to --pool_size.
The max number of URLs per worker. Default is to divide the number of URLs by the number of workers, but you might want to set the size n to a lower number in order to minimize wait time between crawls.
<karman at cpan.org>
Please report any bugs or feature requests to
bug-dezi-bot at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi-Bot. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
You can also look for information at:
Copyright 2013 Peter Karman.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.