The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

head-r – Recurse through Web pages and issue HEAD requests

ABSTRACT

Head-r is a free Perl program that recursively follows links located at (HTML) Web pages hosted on an HTTP server, and performs HEAD upon links of interest to the user.

The intended use for this program is to create URI lists for later selective mirroring of file-hosting sites.

SYNOPSIS

    head-r [-v|--verbose] [-j|--bzip2|-z|--gzip]
        [--include-re=RE] [--exclude-re=RE]
        [--depth=N] [--info-re=RE] [--descend-re=RE]
        [-i|--input=FILE]... [-o|--output=FILE]
        [-P|--no-proxy] [-U|--user-agent=USER-AGENT]
        [-w|--wait=DELAY]
        [--] [URI]...

BASIC USAGE

Arguably, the most important Head-r options are --info-re= and --descend-re=, which determine (by means of regular expressions) which URIs will be considered for mere HEAD requests, and which ones Head-r will try to get more URIs from.

Simplistic, no-recursion example

For the following example, we’ll use . – a regular expression that matches any non-empty string – to allow Head-r to make HEAD requests to both of the URIs given.

    $ head-r --info-re=. \
          -- http://example.org/ http://example.net/ 
    http://example.org/ 1381334900      1       1270    200
    http://example.net/ 1381334903      1       1270    200

The fields are delimited with ASCII HT (also known as TAB) codes, and are as follows:

  1. URI;

  2. timestamp (in seconds since system-dependent epoch; see also Unix time);

  3. recursion depth used when considering this URI;

  4. the length of the response in octets (as per the Content-Length: HTTP reply header);

  5. HTTP status code of the reply.

Recurse once example

For the following example, we’ll also enable actual recursion (still at maximum depth of 1), by using the --descend-re=/\$ option.

    $ head-r --info-re=. --descend-re=/\$ \
          -- http://example.org/ http://example.net/ 
    http://example.org/ 1381337824      1       1270    200
    http://www.iana.org/domains/example 1381337829      0       200
    http://example.net/ 1381337830      1       1270    200

As could be seen, at http://example.org/ Head-r found another URI to consider: http://www.iana.org/domains/example, which it followed and issued a HEAD request for.

It’s easy to check that http://example.net/ actually also references the same URI. However, as Head-r remembers the URIs it processes (along with the recursion depth at the point) no other request was issued.

Limiting HEAD requests

Consider now that the resource we’re to recurse through references URIs that are out of our interest. For the following example, we’ll use a more selective regular expression than . we’ve used above.

    $ head-r --{info,descend}-re=wikipedia\\.org/wiki/ \
          -- http://en.wikipedia.org/wiki/Main_Page 
    http://en.wikipedia.org/wiki/Main_Page      1381339589      1       61499   200
    . . .
    http://en.wikipedia.org/w/api.php?action=rsd
    http://creativecommons.org/licenses/by-sa/3.0/
    . . .
    http://meta.wikimedia.org/
    http://en.wikipedia.org/wiki/Wikipedia      1381339589      0       609859  200
    http://en.wikipedia.org/wiki/Free_content   1381339589      0       124407  200
    . . .

(Please note that we’ve just used the Bash {,} expansion to pass the same regular expression to both --info-re= and --descend-re=. Be sure to adjust to the command line interpreter actually in use.)

In the output above, a number of URIs came without any of the usual information. These URIs were found by Head-r, but as they matched neither “info” (--info-re=) nor “descend” (--descend-re=) regular expressions specified, no action was done to them. The URIs are still output, however, just in case we may decide to adjust the regular expressions themselves.

Skipping unwanted URIs altogether

The --include-re= and --exclude-re= regular expressions are considered before all the other ones, and currently have the following semantics:

  1. the inclusion regular expression is applied first; the URI will be considered if it matches one;

  2. unless decided at the step above, the exclusion regular expression is then applied; the URI will not be considered if it matches one;

  3. unless decided by the rules above, the URI will be considered.

If none of these options are given, any URI will be considered by Head-r.

The following example exploits these options to further limit the output of Head-r for the case above.

    $ head-r --{include,descend}-re=wikipedia\\.org/wiki/ \
          --{info,exclude}-re=. \
          -- http://en.wikipedia.org/wiki/Main_Page 
    http://en.wikipedia.org/wiki/Main_Page      1381341336      1       61499   200
    http://en.wikipedia.org/wiki/Wikipedia      1381341337      0       609859  200
    http://en.wikipedia.org/wiki/Free_content   1381341337      0       124407  200
    http://en.wikipedia.org/wiki/Encyclopedia   1381341337      0       151164  200
    http://en.wikipedia.org/wiki/Wikipedia:Introduction 1381341337      0       50687   200
    . . .

SAVING STATE BETWEEN SESSIONS

Head-r is capable of reading its own output, so to avoid issuing duplicate HEAD requests, and also to discover the URIs of the resources to recurse into.

Restoring what was saved

Let us revisit one of our previous examples, which we’ll now alter to only issue a HEAD request to a couple of pages:

    $ head-r --output=state.a \
          --info-re='/(Free_content|Wikipedia)$' \
          --descend-re=wikipedia\\.org/wiki/ \
          -- http://en.wikipedia.org/wiki/Main_Page 
    $ grep -E \\s < state.a 
    http://en.wikipedia.org/wiki/Main_Page      1381417546      1       61499   200
    http://en.wikipedia.org/wiki/Wikipedia      1381417546      0       609859  200
    http://en.wikipedia.org/wiki/Free_content   1381417546      0       124407  200
    $ 

Now, why not to include a few more pages, such as all the pages with the names starting with F?

    $ head-r \
          --input=state.a --output=state.b \
          --info-re=/wiki/F \
          --descend-re=wikipedia\\.org/wiki/ 
    $ grep -E \\s < state.b 
    http://en.wikipedia.org/wiki/File:Diary_of_a_Nobody_first.jpg       1381417906      0       34344   200
    http://en.wikipedia.org/wiki/File:Progradungula_otwayensis_cropped.png      1381417906      0       30604   200
    http://en.wikipedia.org/wiki/File:AW_TW_PS.jpg      1381417907      0       33297   200
    http://en.wikipedia.org/wiki/Fran%C3%A7ois_Englert  1381417907      0       87860   200
    http://en.wikipedia.org/wiki/File:Washington_Monument_Dusk_Jan_2006.jpg     1381417907      0       83137   200
    http://en.wikipedia.org/wiki/File:Walt_Disney_Concert_Hall,_LA,_CA,_jjron_22.03.2012.jpg    1381417907      0       67225   200
    http://en.wikipedia.org/wiki/Frank_Gehry    1381417907      0       152838  200
    $ 

Note that while our --info-re= has obviously covered http://en.wikipedia.org/wiki/Free_content, no HEAD request was made to the page, as our --input=state.a file already had the relevant information.

Also, as all the URIs we wanted for Head-r to consider were already listed in state.a, it was unnecessary to specify any URIs at the command line. When the URIs come from both command line arguments and --input= files, those coming from command line are considered first.

Compression

As recursing through large Web sites may result in large output lists, Head-r provides support for compression of output data.

The --bzip2 (-j) and --gzip (-z) options select the compression method to use for the output file (either specified with --output=, or standard output.) Head-r, however, will exit with an error if compression is enabled and the output goes to a terminal device.

Head-r transparently decompresses the files given as inputs (--input=), thanks to the IO::Uncompress::AnyUncompress library.

ADJUSTING HTTP CLIENT BEHAVIOR

There’re two options which influence the behavior of the HTTP client used by Head-r: --wait= (-w) and --user-agent= (-U.)

The --wait= option specifies the amount of time, in seconds, to wait between two consecutive HTTP requests. The default is about 2.7 seconds.

The --user-agent= option specifies the value for the User-Agent: header to use in HTTP requests, and may come handy should the target server block access based on this header’s data. The default is composed of the string HEAD-R-Bot/, the Head-r’s own version, and the identity of the libwww-perl library used. For example: HEAD-R-Bot/0.1 libwww-perl/6.05.

BUGS

Please consider reporting any bugs in the Head-r software not listed below via the CPAN RT, https://rt.cpan.org/Public/Dist/Display.html?Name=head-r. The bugs in this documentation should be reported to the respective Wikibooks Talk page – or you may actually fix them yourself!

As for any other automatic retrieval tool, it isn’t impossible to abuse Head-r to cause excessive load on third-party servers. The user is advised to consider the network environment when using the tool, and especially when lowering the --wait= setting, and raising the maximum recursion --depth= beyond reasonable values.

There’s currently no way to disable the /robots.txt file processing.

The code only tries to retrieve URIs from content marked with text/html media type, even though it seems as if the support for application/xhtml+xml (and perhaps several other XML-based types, such as SVG) could be implemented rather easily.

The resource to retrieve URIs from is first loaded into memory, while it should be possible to process it on-the-fly.

The handling of recursion depths retrieved from --input= files may be somewhat unintuitive, and out of the user’s control. (Although it’s still possible to edit such files using third-party tools, such as AWK.)

The code implements a trivial work-around for the long-standing Net::HTTP bug #29468.

AVAILABILITY

The code could be downloaded from a Git repository, like:

    $ git clone -- \
          http://am-1.org/~ivan/archives/git/head-r-2013.git/ head-r 

A Gitweb interface is available from http://am-1.org/~ivan/archives/git/gitweb.cgi?p=head-r-2013.git.

AUTHOR

Head-r is written by Ivan Shmakov.

Head-r is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.