The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Combine - Focused Web crawler framework

SYNOPSIS

combine --jobname <name> --logname <id>

OPTIONS AND ARGUMENTS

jobname is used to find the appropriate configuration (mandatory)

logname is used as identifier in the log (in MySQL table log)

DESCRIPTION

Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.

A simple workflow for a trivial crawl job might look like:

    Initialize database and configuration
  combineINIT --jobname aatest
    Enter some seed URLs from a file with a list of URLs
  combineCtrl  load --jobname aatest < seedURLs.txt
    Start 2 crawl processes
  combineCtrl  start --jobname aatest --harvesters 2

    For some time occasionally schedule new links for crawling
  combineCtrl recyclelinks --jobname aatest
    or look at the size of the ready queue
  combineCtrl stat --jobname aatest

    When satisfied kill the crawlers
  combineCtrl kill --jobname aatest
    Export data records in a highly structured XML format
  combineExport --jobname aatest

For more complex jobs you have to edit the job configuration file.

SEE ALSO

combineINIT, combineCtrl

Combine configuration documentation in /usr/share/doc/combine/.

AUTHOR

Anders Ardö, <anders.ardo@it.lth.se>

COPYRIGHT AND LICENSE

Copyright (C) 2005 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/

1 POD Error

The following errors were encountered while parsing the POD:

Around line 392:

Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252