The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WWW::Wikevent::Bot

SYNOPSIS

  use WWW::Wikevent::Bot;
  use HTML::TreeBuilder;
  use utf8;

  my $bot = WWW::Wikevent::Bot->new();
  $bot->name( 'HideoutBot' );
  $bot->url( 'http://www.hideoutchicago.com/schedule.html' );
  $bot->sample( 'sample.html' );
  $bot->encoding( 'utf8' );

  $bot->parser( sub {
      my ( $bot, $html ) = @_;
      
      # Use HTML::TreeBuilder and HTML::Element, or if you prefer
      # HTML::TokeParser to parse the HTML down to whatever elements
      # contains events, then ...
      foreach my $container ( @event_containers ) {
          my $event = $bot->add_event();

          # build up the event using methods of L<HTML::Wikevent::Event>
      }

      # Figure out the next page to scrape (not needed if you are parsing
      # by month) and set

      $bot->url( $next_page_to_scrape );
  });
  
  $bot->scrape();
  $bot->upload();
  

DESCRIPTION

WWW::Wikevent::Bot is a package which will help you write scraper scripts for gathering events from venue and artist websites and for inclusion in the Free content events compendium, Wikevent.

The module takes care of the tedium of interaction with the website, and leaves to you the fun work of writing the scraper subroutine for the venue or artist you are interested in.

CONSTANTS

item $SEEN_FILE

CONSTRUCTORS

new

Creates a new bot object.

ACCESSORS

name

  $bot->name( $bot_name );

The name of your bot.

This setting will be used to control where your bot will submit information about itself and the list of events it scrapes on each run.

events

  my @events = $bot->events()

or

  my $event_ref = $bot->events()

The list of events which this bot has scraped (so far).

sample

  $bot->sample( 'somepage.html' );

A local file containing a sample page to scrape while you are building and debugging your parser subroutine.

charset

  $bot->charset( 'utf8' );

The charset of the target site/page.

Sometimes the charset is detected incorrectly, or even set incorrectly in venue and artist webpages. This lets you override.

encoding

An alias for charset, if you prefer.

url

  $bot->url( 'http://venue.com/schedule.html' );

The next URL to scrape.

Initially you should set this to the first page which your scraper bot should look at. Afterwords if there are more pages to scrape you'll set it again in your parser subroutine.

If the site you're scraping has calendar pages with elements of the date in the URL you can put Date::Format placeholders in the your URL string, as in:

  $bot->url( 'http://venue.com/calendar.html?year=%Ymonth=%L' );

.. and your bot will scrape months months ahead from the current month, whatever that is. You can of course override this behaviour by specifying a new URL to parse in the parser subroutine, but then you'll have to do all of the date calculation yourself.

months

  $bot->months( $int );
  my $int = $bot->months();

The number of months to scrape if url is a Date::Format specification.

Defaults to 3.

user_dir

  my $dir = $bot->user_dir( $dir );

The directory to which your events will be dumped.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

user_page

  my $page = $bot->user_page( $page );

The page on which information about your bot is to be found.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

shows_page

  my $page = $bot->shows_page( $page );

The page to which events scraped by your bot will be uploaded.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

METHODS

add_event

  my $e = $bot->add_event();

Create a new event and return it.

This is a convenience method which both creates a new event, adds it to events list (see above) and returns a refernce to which you may manipulate as necessary.

parse

  my @events = $bot->parse( $html );

or

  my $events_ref = $bot->parse( $html );

Run the user supplised parser subroutine against the argument HTML and return any events found. This is used internally by scrape.

check_allowed

   $bot->check_allowed();

Check the user page of this bot to see if it is currently allowed to run. This will be indicated by the text:

  run = true

at the top of the page. If that text is present return true, other wise die with an error. This method is called internally by upload so you don't have to call it, but you do have to make sure that the above text appears on the bot's user page.

scrape_sample

  $bot->scrape_sample();

Runs the parser against the supplied sample HTML page.

scrape

  $bot->scrape();

Starts scraping at the supplied url and continues as long as url changes.

scrape_page

  $bot->scrape_page( $url );

Scrapes a single page of HTML found at the given URL. This method is called internally by scrape.

dump

  $bot->dump();

Dumps the contents of events as text to standard out.

remember

  $bot->remember( $event );

Records an md5sum of the given event, so as to not repeat it again when running dump_to_file.

load_remembered_events

  $bot->load_remembered_events

Loads in the md5sums of previously remembered events. This is called internally by new so it's unlikely that you will need to call it.

is_new

  my $bool = $bot->is_new( $event );

Checks to see if the md5sum of an event is in our list of remembered events.

dump_to_file

  $bot->dump_to_file

Prints out the events in their final form to the appropriate .wiki file for upload to the bot's event page. This is called internally by upload but is also useful for the last stages of writing and debugging your bot.

upload

  $bot->upload();

This is the method which interacts with the Wikevent server, first checking to see if the bot is allowed to proceed, then doing an update, printing out the bot's events and then proceeding to do the upload.

BUGS

Please submit bug reports to the CPAN bug tracker at http://rt.cpan.org/NoAuth/Bugs.html?Dist=www-wikevent-bot.

DISCUSSION

Discussion should take place on the Wiki, probably on the page "/wikevent.org/en/Wikevent:Perl library" in http:

AUTHORS

Mark Jaroski <mark@geekhive.net>

Original author, maintainer

LICENSE

Copyright (c) 2004-2005 Mark Jaroski.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

months

  $bot->months( $int );
  my $int = $bot->months();

The number of months to scrape if url is a Date::Format specification.

Defaults to 3.

user_dir

  my $dir = $bot->user_dir( $dir );

The directory to which your events will be dumped.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

user_page

  my $page = $bot->user_page( $page );

The page on which information about your bot is to be found.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

shows_page

  my $page = $bot->shows_page( $page );

The page to which events scraped by your bot will be uploaded.

Normally this is set as a side-effect of setting the name accessor, however it can be optionally set to something else after setting name.

METHODS

add_event

  my $e = $bot->add_event();

Create a new event and return it.

This is a convenience method which both creates a new event, adds it to events list (see above) and returns a refernce to which you may manipulate as necessary.

parse

  my @events = $bot->parse( $html );

or

  my $events_ref = $bot->parse( $html );

Run the user supplised parser subroutine against the argument HTML and return any events found. This is used internally by scrape.

check_allowed

   $bot->check_allowed();

Check the user page of this bot to see if it is currently allowed to run. This will be indicated by the text:

  run = true

at the top of the page. If that text is present return true, other wise die with an error. This method is called internally by upload so you don't have to call it, but you do have to make sure that the above text appears on the bot's user page.

scrape_sample

  $bot->scrape_sample();

Runs the parser against the supplied sample HTML page.

scrape

  $bot->scrape();

Starts scraping at the supplied url and continues as long as url changes.

scrape_page

  $bot->scrape_page( $url );

Scrapes a single page of HTML found at the given URL. This method is called internally by scrape.

dump

  $bot->dump();

Dumps the contents of events as text to standard out.

remember

  $bot->remember( $event );

Records an md5sum of the given event, so as to not repeat it again when running dump_to_file.

load_remembered_events

  $bot->load_remembered_events

Loads in the md5sums of previously remembered events. This is called internally by new so it's unlikely that you will need to call it.

is_new

  my $bool = $bot->is_new( $event );

Checks to see if the md5sum of an event is in our list of remembered events.

dump_to_file

  $bot->dump_to_file

Prints out the events in their final form to the appropriate .wiki file for upload to the bot's event page. This is called internally by upload but is also useful for the last stages of writing and debugging your bot.

upload

  $bot->upload();

This is the method which interacts with the Wikevent server, first checking to see if the bot is allowed to proceed, then doing an update, printing out the bot's events and then proceeding to do the upload.

BUGS

Please submit bug reports to the CPAN bug tracker at http://rt.cpan.org/NoAuth/Bugs.html?Dist=www-wikevent-bot.

DISCUSSION

Discussion should take place on the Wiki, probably on the page "/wikevent.org/en/Wikevent:Perl library" in http:

AUTHORS

Mark Jaroski <mark@geekhive.net>

Original author, maintainer

LICENSE

Copyright (c) 2004-2005 Mark Jaroski.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 671:

=cut found outside a pod block. Skipping to next block.