Jeff Pinyan > YAPE-HTML-1.11 > YAPE::HTML

Download:
YAPE-HTML-1.11.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 1.11   Source  

NAME ^

YAPE::HTML - Yet Another Parser/Extractor for HTML

SYNOPSIS ^

  use YAPE::HTML;
  use strict;
  
  my $content = "<html>...</html>";
  my $parser = YAPE::HTML->new($content);
  my ($extor,@fonts,@urls,@headings,@comments);
  
  # here is the tokenizing part
  while (my $chunk = $parser->next) {
    if ($chunk->type eq 'tag' and $chunk->tag eq 'font') {
      if (my $face = $chunk->get_attr('face')) {
        push @fonts, $face;
      }
    }
  }
  
  # here we catch any errors
  unless ($parser->done) {
    die sprintf "bad HTML: %s (%s)",
      $parser->error, $parser->chunk;
  }
  
  # here is the extracting part
  
  # <A> tags with HREF attributes
  # <IMG> tags with SRC attributes
  $extor = $parser->extract(a => ['href'], img => ['src']);
  while (my $chunk = $extor->()) {
    push @urls, $chunk->get_attr(
      $chunk->tag eq 'a' ? 'href' : 'src'
    );
  }
  
  # <H1>, <H2>, ..., <H6> tags
  $extor = $parser->extract(qr/^h[1-6]$/ => []);
  while (my $chunk = $extor->()) {
    push @headings, $chunk;
  }
  
  # all comments
  $extor = $parser->extract(-COMMENT => []);
  while (my $chunk = $extor->()) {
    push @comments, $chunk;
  }

YAPE MODULES ^

The YAPE hierarchy of modules is an attempt at a unified means of parsing and extracting content. It attempts to maintain a generic interface, to promote simplicity and reusability. The API is powerful, yet simple. The modules do tokenization (which can be intercepted) and build trees, so that extraction of specific nodes is doable.

DESCRIPTION ^

This module is yet another parser and tree-builder for HTML documents. It is designed to make extraction and modification of HTML documents simplistic. The API allows for easy custom additions to the document being parsed, and allows very specific tag, text, and comment extraction.

USAGE ^

In addition to the base class, YAPE::HTML, there is the auxiliary class YAPE::HTML::Element (common to all YAPE base classes) that holds the individual nodes' classes. There is documentation for the node classes in that module's documentation.

HTML elements and their attributes are stored internally as lowercase strings. For clarification, that means that the tag <A HREF="FooBar.html"> is stored as

  {
    TAG => 'a',
    ATTR => {
      href => 'FooBar.html',
    }
  }

This means that tags will be output in lowercase. There will be a feature in a future version to switch output case to capital letters.

Functions

Methods for YAPE::HTML

Extracting Sections

YAPE::HTML allows comprehensive extraction of tags, text, comments, DTDs, PIs, and SSIs, using a simple, yet rich, syntax:

  my $extor = $parser->extract(
    TYPE => [ REQS ],
    ...
  );

TYPE can be either the name of a tag ("table"), a regular expression that matches tags (qr/^t[drh]$/), or a special string to match all tags (-TAG), all text (-TEXT), all comments (-COMMENT), all DTDs (-DTD), all PIs (-PI), and all SSIs (-SSI).

REQS varies from element to element:

Here are some example uses:

FEATURES ^

This is a list of special features of YAPE::HTML.

TO DO ^

This is a listing of things to add to future versions of this module.

API

Internals

BUGS ^

Following is a list of known or reported bugs.

Fixed

Pending

SUPPORT ^

Visit YAPE's web site at http://www.pobox.com/~japhy/YAPE/.

SEE ALSO ^

The YAPE::HTML::Element documentation, for information on the node classes.

AUTHOR ^

  Jeff "japhy" Pinyan
  CPAN ID: PINYAN
  japhy@pobox.com
  http://www.pobox.com/~japhy/
syntax highlighting: