Ralf S. Engelschall > lcwa > LCWA

Download:
lcwa-1.0.0.tar.gz

Annotate this POD

CPAN RT

New  3
Open  0
View/Report Bugs
Source  

NAME ^

LCWA - Last Changes Web Agent

SYNOPSIS ^

lcwa c[rawl] [-d file] [-c num] [-d num] [-r pattern] [-i pattern] [-e pattern] [-b file] [-s file] [-l file] [-v level] [url ...]

lcwa q[uery] [-d file] [-f format] [timespec ...]

lcwa [-V]

DESCRIPTION ^

Lcwa is a web agent which determines the last changes time of documents in an Intranet's webcluster by crawling in a fast way through its webareas via HTTP. It was written with speed in mind, so it uses a variable number of pre-forked crawling clients which work in parallel. They are coordinated via a server which implements a shared URL stack and a common result pool.

Each client first pops off an URL from the stack, and retrieves the document, determines its last changes time and then sends this information back to the result pool. Additionally if the currently fetched document is of MIME type text/html, it parses out all anchor, image and frameset hyperlinks it contains and pushes these back to the shared URL stack, so they can be processed the next time by itself or the other parallel running clients.

The HTTP crawling is done in optimized way: First the request method (GET or HEAD) is determined from the URL and the document fetched. If the method was HEAD and the MIME-type now is text/html the document is requested again with method GET. Then when the MTime cannot be determined but the Server is an Apache one a third request is done to retrieve the information with a possibly installed mod_peephole if option -p is given.

OPTIONS ^

-D file

Sets the database file where the crawled information is stored. Its format is one entry per line consisting of an URL followed by a whitespace followed by the corresponding timestamp (or an error code when the timestamp couldn't determined).

-c num

Start num of crawling clients. Default is num=2, i.e. only one client which determines the last changes times via HTTP crawling. If you specify more clients the crawling is faster but because they are created via pre-forking this is more memory consuming. Expect 3 MB RAM requirement per client.

-a type

This sets the crawling algorithm: type=d means depth first crawling while type=w means width crawling (default).

-d spec

This sets the URL path depth restriction, i.e. when the path depth does not fall into spec the URL is not crawled. Per default there is no such restriction. Here are the accepted syntax variants for spec:

   spec  depth
         min max
   ----  --- ---
   N     N   N
   >N    N+1 oo
   <N    1   N-1
   N-M   N   M
   +N    X   X+N
   -N    X   X-N

To make it clear what the depth means here is an example: the URL http://foo.bar.com/any/url/to/data?some&query is of depth 4 while http://foo.bar.com/any/url/to/data/?some&query is of depth 5. The rule is this: For depth only the path part of an URL counts and here each slash increased the depth, starting with depth=1 for the root-path /.

-r pattern

This restricts the crawled URLs by a regular expression pattern, i.e. only URLs are pushed back to the shared URL stack for further processing which match <b>ALL THOSE</b> pattern.

-i pattern

This restricts the crawled URLs by a regular expression pattern, i.e. URLs which match <b>AT LEAST ONE</b> pattern are forced to be pushed back to the shared URL stack for further processing. If <b>NONE</b> matches the URL is not accepted.

-e pattern

This restricts the crawled URLs by a regular expression pattern, i.e. URLs <b>NO/b which match <b>ALL THOSE</b> patterns are forced to be pushed back to the shared URL stack for further processing. If <b>AN/b matches the URL is not accepted.

-b file

Sets the filename of the temporarily used URL brainfile which is needed while crawling to determine which URLs have been already seen.

-s file

Sets the filename of the temporarily used stack swapfile which is needed while crawling to swap out the in-core stack to avoid to huge memory consumption.

-l file

Sets the filename of the logfile which shows processing information of the crawling process. Only interesting for debugging.

-v level

This sets verbose mode to level where some processing information will be given on the console.

-V

Displays the version string.

EXAMPLES ^

  $ lcwa -c40 \
         -r '^http://[a-zA-Z0-9._]+\.sdm\.de/' \
         -i '.*\.p?html$' -i '.*/$' \
         -d 4 \
         -o sww.times.db \
         http://sww.sdm.de/

AUTHOR ^

 Ralf S. Engelschall
 rse@engelschall.com
 www.engelschall.com
syntax highlighting: