Toshimasa Ishibashi - 石橋 利真 > XML-RSS-FromHTML-0.02 > XML::RSS::FromHTML

Download:
XML-RSS-FromHTML-0.02.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.02   Source   Latest Release: XML-RSS-FromHTML-0.06

NAME ^

XML::RSS::FromHTML - simple framework for making RSS out of HTML

SYNOPSIS ^

  ### create your own sub-class, with these four methods
  package MyModule;
  use base XML::RSS::FromHTML;
  
  sub init {
      my $self = shift;
      # set your configurations here
      $self->name('MyRSS');
      $self->url('http://foo.com/headlines.html');
  }
  
  sub defineRSS {
      my $self = shift;
      my $xmlrss  = shift;
      # define your RSS using XML::RSS->channel method
      $xmlrss->channel(
          title => 'foo.com headlines feed',
          description => 'generated from http://foo.com headlines'
      );
  }
  
  sub makeItemList {
      my $self = shift;
      my $html = shift;
      # parse HTML and make an item list
      my @list;
      while ($html =~ m|<li><a href="(.+?)">(.+?)</a></li>|g){
          push(@list,{
              link  => $1,
              title => $2
          });
      }
      return \@list;
  }
  
  sub addNewItem {
      my $self = shift;
      my ($xmlrss,$eachItem) = @_;
      # make your item using XML::RSS->add_item method
      $xmlrss->add_item(
          title => $eachItem->{title},
          link  => $eachItem->{link},
          description => 'this is '. $eachItem->{title},
      );
  }
  
  #### and from your main routine...
  package main;
  use MyModule;
  my $rss = MyModule->new;
  $rss->update;
  # an updated RSS file './MyRSS.xml' will be created.
  # run this script every day, and your RSS will always 
  # be up-to-date.

DESCRIPTION ^

This module is a simple framework for creating RSS out of HTML periodically. There are still plenty of web sites that doesn't supply RSS feeds, which we think it would be nice if they did. This module helps you create RSS feeds for those sites by your-own-hand, and maintain the contents up to date. The core features are as follows:

It's mostly focused on trying not to be an annoyance to the target url/web site (and of course, developer-friendliness). We don't want to be seen as spams, but would be nice if we could tell them the value of RSS feeds.

USAGE ^

BASIC

This module is not intended to work by itself. You will need to create a sub class of it, and define these four methods with customization for your target url/web site.

FOUR METHODS

init()

  sub init {
      my $self = shift;
      # set your configurations here
      $self->name('Test');
      $self->url('http://foo.com/headlines.html');
      $self->cacheDir('./cache');
      $self->feedDir('./feed');
      return 1;
  }

Called with-in the constructor, this method should initialize property values of your choice. See the PROPERTIES section for description of available properties.

defineRSS()

Define your RSS feed descriptions and informations here, using the XML::RSS->channel method.

  sub defineRSS {
      my $self = shift;
      my $xmlrss = shift;
      # define your RSS using XML::RSS->channel method
      $xmlrss->channel(
          title => 'foo.com headlines feed',
          description => 'generated from http://foo.com headlines'
      );
      # you can also define images with XML::RSS->image method
      $xmlrss->image(
          title  => 'foo.com headlines feed',
          url    => 'http://mysite/image/logo.gif',
          link   => 'http://foo.com/headlines.html'
      );
      return 1;
  }

makeItemList()

With the whole html string (supplied as argument), use whatever mean (i.e. regexp) to create a data structure of items. Later on, you'll be using these information to create feed items.

  sub makeItemList {
      my $self = shift;
      my $html = shift;
      # parse HTML and make an item list
      my @list;
      while ($html =~ m| .. some mumbling regexp here .. |g){
          push(@list,{
              link     => $1,
              title    => $2,
              category => $3,
              id       => $4,
              ...
          });
      }
      return \@list;
  }

addNewItem()

From the list created with above method (makeItemList), the framework will check for updates, and will call this method for each new items. Thus, the argument $eachItem represents the iterator (each element of @list created with $self->makeItemList) object. Use XML::RSS->add_item method to add a new item to the RSS feed. You can also fetch any additional information about the item, like from the description page, and add them to the feed too.

  sub addNewItem {
      my $self = shift;
      my ($xmlrss,$eachItem) = @_;
      # fetch additional information if you want to
      require LWP::Simple;
      my $html = get("http://foo.com/archives/$eachItem->{id}.html");
      my ($desc) = ($html =~ m|<p class="desc">(.+?)</p>|);
      # make your rss item using XML::RSS->add_item method
      $xmlrss->add_item(
          title => $eachItem->{title},
          link  => $eachItem->{link},
          category => $eachItem->{cateogry},
          description => $desc,
      );
      return 1;
  }

HOW TO USE

Basically, all you need to do is load your sub-class module, create new instance, and call the update method. The return value of update method is a boolean value, representing:

And with $self->updateStatus method, you'll be informed with a status message.

  use MyModule;
  my $rss = MyModule->new;
  my $hasNewItem = $rss->update;
  if($hasNewItem){
    print "RSS updated with some new items";
    return 1;
  }else{
    # i.e. "still under check interval time period"
    print $rss->updateStatus; 
    return undef;
  }

PROPERTIES

These are all the properties available for configuration within $self->init method.

OTHER USEFUL PROPERTIES

updateStatus

As described above (section HOW TO USE), this property contains some helpful message about the update sequence. Currently there are:

newItems

An array reference to all the items that were counted as new item. Sometimes usefull after $self->update method call.

  $rss->update;
  print "there were " scalar @{ $rss->newItems } . " items new.\n";
  foreach (@{ $rss->newItems }){
      print "title: $_->{title}\n";
  }

OTHER USEFUL METHODS

as_string()

Will return RSS feed as XML string.

as_object()

Will return XML::RSS object of the current RSS feed.

getDateTime()

Will return the current date + time in a RFC 1123 styled GMT Ascii format, like this:

  Sun, 06 Nov 1994 08:49:37 GMT

Useful for date/time related elements within RSS feed (i.e. pubDate). Also, if passed with some kind of a date-time string as an argument, it'll try it's best to parse the string and return as GMT Ascii format string as well.

  print $self->getDateTime('19940203T141529Z');
  # will print 'Thu, 03 Feb 1994 14:15:29 GMT'

It uses HTTP::Date internally, so see HTTP::Date's parse_date() method documentation for available (parse-able) formats.

TIPS ^

RETRIEVING HTML FROM SESSION REQUIRED WEB SITE

With some web sites, they require a valid session-id in your browser cookie or query string in order to retrieve their contents. The session id is usually given to you the first time you visit their TOP PAGE, or of course, when you go through the LOGIN process.

If you want/need to retrieve some HTML from pages that require these session id's, you should override the $self->getHTML method with your own customization. For example, assuming a web site that gives you session-id's when you access their top.cgi page, the getHTML method will be like this:

  sub getHTML {
      my $self = shift;
      my $url = shift;
      my $ua = LWP::UserAgent->new;
      $ua->cookie_jar({ file => $self->cacheDir.'/'.$self->name.'.cookie' });
      $ua->get('http://foo.com/top.cgi'); # set session-id in cookie
      my $res = $ua->get($url); # send with session-id cookie
      return $res->content;
  }

BUGS ^

Nothing that I'm aware of, yet.

AUTHOR ^

  Toshimasa Ishibashi
  CPAN ID: BASHI
  iandeth99@ybb.ne.jp
  http://iandeth.dyndns.org/mt/ian/

COPYRIGHT ^

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

SEE ALSO ^

perl(1). XML::RSS

syntax highlighting: