The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    WWW::Blog::Metadata - Extract common metadata from weblogs

SYNOPSIS
        use WWW::Blog::Metadata;
        use Data::Dumper;
        my $uri;
        my $meta = WWW::Blog::Metadata->extract_from_uri($uri)
            or die WWW::Blog::Metadata->errstr;
        print Dumper $meta;

DESCRIPTION
    *WWW::Blog::Metadata* extracts common metadata from weblogs: syndication
    feed URIs, FOAF URIs, locative information, etc. Some benefits of using
    *WWW::Blog::Metadata*:

    *   The extraction makes only one parsing pass over the HTML, rather
        than one for each type of metadata. It also attempts to be
        intelligent about only parsing as much of the HTML document as is
        required to give you the metadata that you need.

    *   Many of the types of metadata that *WWW::Blog::Metadata* extracts
        can be found in multiple places in an HTML document. This module
        does the work for you, and abstracts it all behind an API.

USAGE
  WWW::Blog::Metadata->extract_from_uri($uri)
    Given a URI *$uri* pointing to a weblog, fetches the page contents, and
    attempts to extract common metadata from that weblog.

    On error, returns "undef", and the error message can be obtained by
    calling *WWW::Blog::Metadata->errstr*.

    On success, returns a *WWW::Blog::Metadata* object.

  WWW::Blog::Metadata->extract_from_html($html [, $base_uri ])
    Uses the same extraction mechanism as *extract_from_uri*, but assumes
    that you've already fetched the HTML document and will provide it in
    *$html*, which should be a reference to a scalar containing the HTML.

    If you know the base URI of the document, you should provide it in
    *$base_uri*. *WWW::Blog::Metadata* will attempt to find the base URI of
    the document if it's specified in the HTML itself, but you can give it a
    head start by passing in *$base_uri*.

    This method has the same return value as *extract_from_uri*.

  $meta->feeds
    A reference to a hash of syndication feed URIs.

    (Note: these are currently extracted using *Feed::Find*, which requires
    a separate parsing step, and sort of renders the above benefit #1
    somewhat of a lie. This is done for maximum correctness, but it's
    possible it could change at some point.)

  $meta->foaf_uri
    The URI for a FOAF file, specified in the standard manner used for FOAF
    auto-discovery.

  $meta->lat
  $meta->lon
    The latitude and longitude specified for the weblog, from either *icbm*
    or *geo.position* *<meta />* tags.

  $meta->generator
    The tool that generated the weblog, from a *generator* *<meta />* tag.

PLUGINS
    There are endless amounts of metadata that you might want to extract
    from a weblog, and the methods above are only what are provided by
    default. If you'd like to extract more information, you can use
    *WWW::Blog::Metadata*'s plugin architecture to build access to the
    metadata that you want, while while making only one parsing pass over
    the HTML document.

    The plugin architecture uses *Module::Pluggable::Ordered*, and it
    provides 2 pluggable events:

    *   on_got_html

        This event is fired before the HTML document is parsed, so you
        should use it for extracting metadata after the page has been
        fetched (if you're using *extract_from_uri*), but before it's been
        parsed.

        Your method will receive 4 parameters: the class name; the
        *WWW::Blog::Metadata* object; a reference to a string containing the
        HTML document; and the base URI of the document.

        You could use this event to run heuristics on either the HTML or the
        URI, or both. The following example uses *WWW::Blog::Identify* to
        attempt to identify the true generator of the weblog:

            package WWW::Blog::Metadata::Flavor;
            use strict;

            use WWW::Blog::Identify qw( identify );
            use WWW::Blog::Metadata;
            WWW::Blog::Metadata->mk_accessors(qw( flavor ));

            sub on_got_html {
                my $class = shift;
                my($meta, $html, $base_uri) = @_;
                $meta->flavor( identify($base_uri, $$html) );
            }
            sub on_got_html_order { 99 }

            1;

        This automatically adds a new accessor to the *$meta* object that is
        returned from the *extract_from_** methods, so you can call

            my $meta = WWW::Blog::Metadata->extract_from_uri($uri);
            print $meta->flavor;

        to retrieve the name of the identified weblogging tool.

    *   on_got_tag

        This event is fired for each HTML tag found in the document during
        the parsing phase.

        Your method will receive 5 parameters: the class name; the
        *WWW::Blog::Metadata* object; the tag name; a reference to a hash
        containing the tag attributes; and the base URI.

        The following example looks for the specific tag identifying the URI
        for the RSD (Really Simple Discoverability) file identifying the
        editing APIs that the weblog supports.

            package WWW::Blog::Metadata::RSD;
            use strict;

            use WWW::Blog::Metadata;
            WWW::Blog::Metadata->mk_accessors(qw( rsd_uri ));

            sub on_got_tag {
                my $class = shift;
                my($meta, $tag, $attr, $base_uri) = @_;
                if ($tag eq 'link' && $attr->{rel} =~ /\bEditURI\b/i &&
                    $attr->{type} eq 'application/rsd+xml') {
                    $meta->rsd_uri(URI->new_abs($attr->{href}, $base_uri)->as_string);
                }
            }
            sub on_got_tag_order { 99 }

            1;

        This automatically adds a new accessor to the *$meta* object that is
        returned from the *extract_from_** methods, so you can call

            my $meta = WWW::Blog::Metadata->extract_from_uri($uri);
            print $meta->rsd_uri;

        to retrieve the URI for the RSD file.

    *   on_finished

        This event is fired at the end of the extraction process.

        Your method will receive 2 parameters: the class name, and the
        *WWW::Blog::Metadata* object.

LICENSE
    *WWW::Blog::Metadata* is free software; you may redistribute it and/or
    modify it under the same terms as Perl itself.

AUTHOR & COPYRIGHT
    Except where otherwise noted, *WWW::Blog::Metadata* is Copyright 2005
    Benjamin Trott, ben+cpan@stupidfool.org. All rights reserved.