LCWA - Last Changes Web Agent
lcwa c[rawl] [-d file] [-c num] [-d num] [-r pattern] [-i pattern] [-e pattern] [-b file] [-s file] [-l file] [-v level] [url ...]
lcwa q[uery] [-d file] [-f format] [timespec ...]
Lcwa is a web agent which determines the last changes time of documents in an Intranet's webcluster by crawling in a fast way through its webareas via HTTP. It was written with speed in mind, so it uses a variable number of pre-forked crawling clients which work in parallel. They are coordinated via a server which implements a shared URL stack and a common result pool.
Each client first pops off an URL from the stack,
and retrieves the document,
determines its last changes time and then sends this information back to the result pool.
Additionally if the currently fetched document is of MIME type
it parses out all anchor,
image and frameset hyperlinks it contains and pushes these back to the shared URL stack,
so they can be processed the next time by itself or the other parallel running clients.
The HTTP crawling is done in optimized way: First the request method (GET or HEAD) is determined from the URL and the document fetched.
If the method was HEAD and the MIME-type now is
text/html the document is requested again with method GET.
Then when the MTime cannot be determined but the Server is an Apache one a third request is done to retrieve the information with a possibly installed mod_peephole if option -p is given.
Sets the database file where the crawled information is stored. Its format is one entry per line consisting of an URL followed by a whitespace followed by the corresponding timestamp (or an error code when the timestamp couldn't determined).
Start num of crawling clients. Default is num=2, i.e. only one client which determines the last changes times via HTTP crawling. If you specify more clients the crawling is faster but because they are created via pre-forking this is more memory consuming. Expect 3 MB RAM requirement per client.
This sets the crawling algorithm: type=
d means depth first crawling while type=
w means width crawling (default).
This sets the URL path depth restriction, i.e. when the path depth does not fall into spec the URL is not crawled. Per default there is no such restriction. Here are the accepted syntax variants for spec:
spec depth min max ---- --- --- N N N >N N+1 oo <N 1 N-1 N-M N M +N X X+N -N X X-N
To make it clear what the depth means here is an example: the URL
http://foo.bar.com/any/url/to/data?some&query is of depth 4 while
http://foo.bar.com/any/url/to/data/?some&query is of depth 5. The rule is this: For depth only the path part of an URL counts and here each slash increased the depth, starting with depth=1 for the root-path
This restricts the crawled URLs by a regular expression pattern, i.e. only URLs are pushed back to the shared URL stack for further processing which match <b>ALL THOSE</b> pattern.
This restricts the crawled URLs by a regular expression pattern, i.e. URLs which match <b>AT LEAST ONE</b> pattern are forced to be pushed back to the shared URL stack for further processing. If <b>NONE</b> matches the URL is not accepted.
This restricts the crawled URLs by a regular expression pattern, i.e. URLs <b>NO/b which match <b>ALL THOSE</b> patterns are forced to be pushed back to the shared URL stack for further processing. If <b>AN/b matches the URL is not accepted.
Sets the filename of the temporarily used URL brainfile which is needed while crawling to determine which URLs have been already seen.
Sets the filename of the temporarily used stack swapfile which is needed while crawling to swap out the in-core stack to avoid to huge memory consumption.
Sets the filename of the logfile which shows processing information of the crawling process. Only interesting for debugging.
This sets verbose mode to level where some processing information will be given on the console.
Displays the version string.
$ lcwa -c40 \ -r '^http://[a-zA-Z0-9._]+\.sdm\.de/' \ -i '.*\.p?html$' -i '.*/$' \ -d 4 \ -o sww.times.db \ http://sww.sdm.de/
Ralf S. Engelschall email@example.com www.engelschall.com