Yannick SImon > WWW-RobotRules-Extended > WWW::RobotRules::Extended



Annotate this POD


New  4
Open  0
View/Report Bugs
Module Version: 0.02   Source  


WWW::RobotRules::Extended - database of robots.txt-derived permissions. This is a fork of WWW::RobotRules

You should use WWW::RobotsRules::Extended if you want to act as Googlebot : Google accept some improvments like "allow" directives or wildcards "*" into rules


Version 0.02


Quick summary of what the module does.

 use WWW::RobotRules::Extended;
 use LWP::Simple qw(get);
 my $rules = WWW::RobotRules::Extended->new('MOMspider/1.0');

   my $url = "http://some.place/robots.txt";
   my $robots_txt = get $url;
   $rules->parse($url, $robots_txt) if defined $robots_txt;

   my $url = "http://some.other.place/robots.txt";
   my $robots_txt = get $url;
   $rules->parse($url, $robots_txt) if defined $robots_txt;

 # Now we can check if a URL is valid for those servers
 # whose "robots.txt" files we've gotten and parsed:
 if($rules->allowed($url)) {
     $c = get $url;


This module parses /robots.txt files as specified in "A Standard for Robot Exclusion", at <http://www.robotstxt.org/wc/norobots.html>

It also parses rules that contains wildcards '*' and allow directives like Google does.

Webmasters can use the /robots.txt file to forbid conforming robots from accessing parts of their web site.

The parsed files are kept in a WWW::RobotRules::Extended object, and this object provides methods to check if access to a given URL is prohibited. The same WWW::RobotRules::Extended object can be used for one or more parsed /robots.txt files on any number of hosts.


A list of functions that can be exported. You can delete this section if you don't export anything, such as for a purely object-oriented module.

new This is the constructor for WWW::RobotRules::Extended objects. The first argument given to new() is the name of the robot.

parse The parse() method takes as arguments the URL that was used to retrieve the /robots.txt file, and the contents of the file.


Returns TRUE if this robot is allowed to retrieve this URL.



Returns TRUE if this robot is allowed to retrieve this URL.



Yannick Simon, <yannick.simon at gmail.com>


Please report any bugs or feature requests to bug-www-robotrules-extended at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=WWW-RobotRules-Extended. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.


You can find documentation for this module with the perldoc command.

    perldoc WWW::RobotRules::Extended

You can also look for information at:


The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/":

  User-agent: *
  Disallow: /cyberworld/map/ # This is an infinite virtual URL space
  Disallow: /tmp/ # these will soon disappear

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":

  User-agent: *
  Disallow: /cyberworld/map/ # This is an infinite virtual URL space

  # Cybermapper knows where to go.
  User-agent: cybermapper

This example indicates that no robots should visit this site further:

  # go away
  User-agent: *
  Disallow: /

This is an example of a malformed robots.txt file.

  # robots.txt for ancientcastle.example.com
  # I've locked myself away.
  User-agent: *
  Disallow: /
  # The castle is your home now, so you can go anywhere you like.
  User-agent: Belle
  Disallow: /west-wing/ # except the west wing!
  # It's good to be the Prince...
  User-agent: Beast

This file is missing the required blank lines between records. However, the intention is clear.

This is an example of an extended robots.txt file tou have a real example of this kind of rules on http://www.google.com/robots.txt

  # Block every url that contains &p=
  User-agent: *
  Disallow: /*&p=

This is an example of an extended robots.txt file.

  # Block every url but the ones that begin with /shared
  User-agent: *
  Disallow: /
  Allow: /shared/


LWP::RobotUA, WWW::RobotRules::AnyDBM_File, WWW::RobotRules



  Copyright 2011, Yannick Simon
  Copyright 1995-2009, Gisle Aas
  Copyright 1995, Martijn Koster

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

syntax highlighting: