The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

RDF::Scutter - Perl extension for harvesting distributed RDF resources

SYNOPSIS

  use RDF::Scutter;
  use RDF::Redland;
  my $scutter = RDF::Scutter->new(scutterplan => ['http://www.kjetil.kjernsmo.net/foaf.rdf',
                                                  'http://my.opera.com/kjetilk/xml/foaf/'],
                                  from => 'scutterer@example.invalid');

  my $storage=new RDF::Redland::Storage("hashes", "rdfscutter", "new='yes',hash-type='bdb',dir='/tmp/',contexts='yes'");
  my $model = $scutter->scutter($storage, 30);
  my $serializer=new RDF::Redland::Serializer("ntriples");
  print $serializer->serialize_model_to_string(undef,$model);

DESCRIPTION

As the name implies, this is an RDF Scutter. A scutter is a web robot that follows seeAlso-links, retrieves the content it finds at those URLs, and adds the RDF statements it finds there to its own store of RDF statements.

This module is an alpha release of such a Scutter. It builds a RDF::Redland::Model, and can add statements to any RDF::Redland::Storage that supports contexts. Among Redland storages, we find file, memory, Berkeley DB, MySQL, etc.

This class inherits from LWP::RobotUA, which again is a LWP::UserAgent and can therefore use all methods of these classes.

The latter implies it is robot that by default behaves nicely, it checks robots.txt, and sleeps between connections to make sure it doesn't overload remote servers.

It implements most of the ScutterVocab at http://rdfweb.org/topic/ScutterVocab

CAUTION

This is an alpha release, and I haven't tested very thoroughly what it can do if left unsupervised, and you might want to be careful about finding out... The example in the Synopsis a complete scutter, but one that will retrieve only 30 URLs before returning. You could test it by entering your own URLs (optional) and a valid email address (mandatory). It'll count and report what it is doing.

METHODS

new(scutterplan => ARRAYREF, from => EMAILADDRESS, [skipregexp => REGEXP, any LWP::RobotUA parameters])

This is the constructor of the Scutter. You will have to initialise it with a scutterplan argument, which is an ARRAYREF containing URLs pointing to RDF resources. The Scutter will start its traverse of the web there. You must also set a valid email address in a from, so that if your scutter goes amok, your victims will know who to blame.

You may supply a skipregexp argument, containing a regular expression. If the regular expression matches the URL of a resource, the resource will be skipped.

Finally, you may supply any arguments a LWP::RobotUA and LWP::UserAgent accepts.

scutter(RDF::Redland::Storage [, MAXURLS]);

This method will launch the Scutter. As first argument, it takes a RDF::Redland::Storage object. This allows you to store your model any way Redland supports, and it is very flexible, see its documentation for details. Optionally, it takes an integer as second argument, giving the maximum number of URLs to retrieve successfully. This provides some security against a runaway robot.

It will return a RDF::Redland::Model containing a model with all statements retrieved from all visited resources.

BUGS/TODO

There are no known real bugs at the time of this writing, keeping in mind it is an alpha. If you find any, please use the CPAN Request Tracker to report them.

However, I have tried to add some code to allow the robot to temporarily skip over and later revisit a resource that couldn't be visited at the time of initital request per robot guidelines. This code is in there, but is undocumented as I couldn't get it to work.

I'm in it slightly over my head when I try to add the ScutterVocab statements. Time will show if I have understood it correctly.

Allthough it uses LWP::Debug to debugging, the author feels it is somewhat problematic to find the right amount of output from the module. Subsequent releases are likely to be more quiet than the present release, however.

For an initial release, heeding robots.txt is actually pretty groundbreaking. However, a good robot should also make use of HTTP caching, keywords are Etags, Last-Modified and Expiry. It will be a focus of upcoming development, and many of these things are now being stated about the context in the RDF.

It is not clear how long it would be running, or how it would perform if set to retrieve as much as it could. Currently, it is a serial robot, but there exists Perl modules to make parallell robots. If it is found that a serial robot is too limited, it will necessarily require attention.

One of these days, it seems like I will have to make a full HTTP headers vocabulary...

SEE ALSO

RDF::Redland, LWP.

SUBVERSION REPOSITORY

This code is maintained in a Subversion repository. You may check out the trunk using e.g.

  svn checkout http://svn.kjernsmo.net/RDF-Scutter/trunk/ RDF-Scutter

AUTHOR

Kjetil Kjernsmo, <kjetilk@cpan.org>

ACKNOWLEDGEMENTS

Many thanks to Dave Beckett for writing the Redland framework and for helping when the author was confused, and to Dan Brickley for interesting discussions. Also thanks to the LWP authors for their excellent library.

COPYRIGHT AND LICENSE

Copyright (C) 2005 by Kjetil Kjernsmo

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.