Combine - Focused Web crawler framework
combine --jobname <name> --logname <id>
jobname is used to find the appropriate configuration (mandatory)
logname is used as identifier in the log (in MySQL table log)
Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.
combineCtrl
Tidy
A simple workflow for a trivial crawl job might look like:
Initialize database and configuration combineINIT --jobname aatest Enter some seed URLs from a file with a list of URLs combineCtrl load --jobname aatest < seedURLs.txt Start 2 crawl processes combineCtrl start --jobname aatest --harvesters 2 For some time occasionally schedule new links for crawling combineCtrl recyclelinks --jobname aatest or look at the size of the ready queue combineCtrl stat --jobname aatest When satisfied kill the crawlers combineCtrl kill --jobname aatest Export data records in a highly structured XML format combineExport --jobname aatest
For more complex jobs you have to edit the job configuration file.
combineINIT, combineCtrl
Combine configuration documentation in /usr/share/doc/combine/.
Anders Ardö, <anders.ardo@it.lth.se>
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252
To install Combine::UA, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Combine::UA
CPAN shell
perl -MCPAN -e shell install Combine::UA
For more information on module installation, please visit the detailed CPAN module installation guide.