Andreas J. König > File-Rsync-Mirror-Recent > HOWTO.mirrorcpan

Download:
File-Rsync-Mirror-Recent-0.3.4.tar.bz2

Annotate this POD

CPAN RT

New  2
Open  4
View/Report Bugs
Source  

HOW TO MIRROR CPAN ^

As of this writing the single purpose for this distribution is to provide a backbone for the CPAN.

The idea is to get a handful backbone nodes that mirror directly from PAUSE and a second layer that mirrors from them. Targetting at a mirroring interval of 20 seconds.

The rsync daemon on PAUSE runs on port 8732 and you need a password to access it. There is a second rsync daemon running on the standard port 873 but it has very limited access policies to leave the bandwidth for the backbone.

If you have username and password you can mirror directly from PAUSE. If you haven't and maintain a public CPAN mirror, ask me for one. Otherwise pick your best CPAN mirror with an rsync server and try the same script below for getting the quickest update cycle possible.

You find a list of potential rsync servers at http://cpan.perl.org/SITES.html#RSYNC

The first thing you should prepare is the CPAN tree itself on your disk. The source where you take it from does not matter that much. Take it from where you always took it. The setup I suggest is to mirror authors/ and modules/ with this program.

The loop is something like this:

Short version:

    #!/bin/sh

    RSYNC_PASSWORD=secret
    export RSYNC_PASSWORD

    for t in modules authors ; do
      rrr-client --source rsync://andk@pause.perl.org:8732/PAUSE/$t/ --target /home/ftp/pub/PAUSE/$t/ --tmpdir /home/tmp &
    done

Or if the short version is not sufficient for some reason:

    $ENV{USER}="sandy"; # fill in your name
    $ENV{RSYNC_PASSWORD} = "secret"; # fill in your passwd

    use File::Rsync::Mirror::Recent;
    my @rrr = map {
        File::Rsync::Mirror::Recent->new
                (
                 localroot                  => "/home/ftp/pub/PAUSE/$_", # your local path
                 remote                     => "pause.perl.org::PAUSE/$_/RECENT.recent", # your upstream
                 max_files_per_connection   => 863,
                 tempdir                    => "/home/ftp/tmp", # optional tempdir to hide temporary files
                 ttl                        => 10,
                 rsync_options              =>
                 {
                  port              => 8732, # only for PAUSE
                  compress          => 1,
                  links             => 1,
                  times             => 1,
                  checksum          => 0,
                  'omit-dir-times'  => 1, # not available before rsync 3.0.3
                  timeout           => 300, # do not allow rsync to hang forever
                 },
                 verbose                    => 1,
                 verboselog                 => "/var/log/rmirror-pause.log",
                )} "authors", "modules";
    die "directory $_ doesn't exist, giving up" for grep { ! -d $_->localroot } @rrr;
    while (){
        my $ttgo = time + 1200; # pick less if you have a password and/or consent from the upstream
        for my $rrr (@rrr){
            $rrr->rmirror ( "skip-deletes" => 1 );
        }
        my $sleep = $ttgo - time;
        if ($sleep >= 1) {
            print STDERR "sleeping $sleep ... ";
            sleep $sleep;
        }
    }

Don't forget to fill in your user name and password.

You see the 'skip-deletes' and guess, this is mirroring without doing deletes. You can do it with deletes or you can leave the deletion to the other, the traditional rsync loop. Worry about that later.

You see the option 'max_files_per_connection' which I filled with 863 and you need to find your favorite number. The larger the number the faster the whole download goes. As you already have a mirror, you probably want to take 10 or 20 times times that much. CPAN's authors directory is at the moment (2010-10) consisting of 180000 files. This option chunks the downloads into piecemeal units and between these units the process takes a peek at the most recent files to download them with higher preference. That way we get the newest files immediately in sync while we are mirroring the long tail of old files.

The localroot parameter contains your target directory. Because only the authors/ and the modules/ directory are hosted by the PAUSE, you need the loop over two directories. The other directories of CPAN are currently not available for downloading with File::Rsync::Mirror::Recent.

The 'port' is only needed for users who have a password for PAUSE, other rsync servers listen on the rsync server port and the option can be omitted.

The timeslice in the while loop above needs to be large enough to let the rsync server survive. If you choose a random rsync server and are not an rsync server yourself please be modest and choose 1200. Choose less if you're offering rsync yourself and have a fat pipe, and especially if you know your upstream can manage it. If you have a PAUSE password choose 20 seconds. We will watch how well this works and will adjust this recommendation according to our findings.

Set the key 'verbose' to 0 if you have no debugging demands. In this case you want to omit the 'sleeping $sleep ...' noise further down the script as well.

Set the key 'verboselog' to your favorite progress watcher file. This option is underdeveloped and may later be replaced with something better. Note that the program still sends error messages to STDERR.

You can leave everything else as it stands. Start experimenting and let me know how it works out.

-- andreas koenig

syntax highlighting: