Ken Williams > HTML-SimpleParse > HTML::SimpleParse

Download:
HTML-SimpleParse-0.12.tar.gz

Dependencies

Annotate this POD

Related Modules

HTML::Parser
HTML::TokeParser
LWP::Simple
Parse::RecDescent
Regexp::Common
HTML::Lint
Compress::Zlib
Devel::Size
Inline::C
HTML::FromText
more...
By perlmonks.org
View/Report Bugs
Module Version: 0.12   Source  

NAME ^

HTML::SimpleParse - a bare-bones HTML parser

SYNOPSIS ^

 use HTML::SimpleParse;

 # Parse the text into a simple tree
 my $p = new HTML::SimpleParse( $html_text );
 $p->output;                 # Output the HTML verbatim
 
 $p->text( $new_text );      # Give it some new HTML to chew on
 $p->parse                   # Parse the new HTML
 $p->output;

 my %attrs = HTML::SimpleParse->parse_args('A="xx" B=3');
 # %attrs is now ('A' => 'xx', 'B' => '3')

DESCRIPTION ^

This module is a simple HTML parser. It is similar in concept to HTML::Parser, but it differs from HTML::TreeBuilder in a couple of important ways.

First, HTML::TreeBuilder knows which tags can contain other tags, which start tags have corresponding end tags, which tags can exist only in the <HEAD> portion of the document, and so forth. HTML::SimpleParse does not know any of these things. It just finds tags and text in the HTML you give it, it does not care about the specific content of these tags (though it does distiguish between different _types_ of tags, such as comments, starting tags like <b>, ending tags like </b>, and so on).

Second, HTML::SimpleParse does not create a hierarchical tree of HTML content, but rather a simple linear list. It does not pay any attention to balancing start tags with corresponding end tags, or which pairs of tags are inside other pairs of tags.

Because of these characteristics, you can make a very effective HTML filter by sub-classing HTML::SimpleParse. For example, to remove all comments from HTML:

 package NoComment;
 use HTML::SimpleParse;
 @ISA = qw(HTML::SimpleParse);
 sub output_comment {}
 
 package main;
 NoComment->new($some_html)->output;

Historically, I started the HTML::SimpleParse project in part because of a misunderstanding about HTML::Parser's functionality. Many aspects of these two modules actually overlap. I continue to maintain the HTML::SimpleParse module because people seem to be depending on it, and because beginners sometimes find HTML::SimpleParse to be simpler than HTML::Parser's more powerful interface. People also seem to get a fair amount of usage out of the parse_args() method directly.

Methods

The following methods do the actual outputting of the various parts of the HTML. Override some of them if you want to change the way the HTML is output. For instance, to strip comments from the HTML, override the output_comment method like so:

 # In subclass:
 sub output_comment { }  # Does nothing

CAVEATS ^

Please do not assume that the interface here is stable. This is a first pass, and I'm still trying to incorporate suggestions from the community. If you employ this module somewhere, make doubly sure before upgrading that none of your code breaks when you use the newer version.

BUGS ^

TO DO ^

AUTHOR ^

Ken Williams <ken@forum.swarthmore.edu>

COPYRIGHT ^

Copyright 1998 Swarthmore College. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: