HTML::Parser::Simple - Parse nice HTML files without needing a compiler
#!/usr/bin/env perl use strict; use warnings; use HTML::Parser::Simple; # ------------------------- # Method 1: my($p) = HTML::Parser::Simple -> new ( input_file => 'data/s.1.html', output_file => 'data/s.2.html', ); $p -> parse_file; # Method 2: my($p) = HTML::Parser::Simple -> new; $p -> parse_file('data/s.1.html', 'data/s.2.html'); # Method 3: my($p) = HTML::Parser::Simple -> new; print $p -> parse('<html>...</html>') -> traverse($p -> root) -> result;
Of course, these can be abbreviated by using method chaining. E.g. Method 2 could be:
HTML::Parser::Simple -> new -> parse_file('data/s.1.html', 'data/s.2.html');
See scripts/parse.html.pl and scripts/parse.xhtml.pl.
HTML::Parser::Simple is a pure Perl module.
HTML::Parser::Simple
It parses HTML V 4 files, and generates a tree of nodes, with 1 node per HTML tag.
The data associated with each node is documented in the "FAQ".
See also HTML::Parser::Simple::Attributes and HTML::Parser::Simple::Reporter.
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules.html for details.
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.
new(...) returns an object of type HTML::Parser::Simple.
This is the class contructor.
Usage: HTML::Parser::Simple -> new.
HTML::Parser::Simple -> new
This method takes a hash of options.
Call new() as new(option_1 => value_1, option_2 => value_2, ...).
new()
new(option_1 => value_1, option_2 => value_2, ...)
Available options (each one of which is also a method):
This takes the file name, including the path, of the input file.
Default: '' (the empty string).
This takes the file name, including the path, of the output file.
This takes either a 0 or a 1.
Write more or less progress messages.
Default: 0.
0 means do not accept an XML declaration, such as <?xml version="1.0" encoding="UTF-8"?> at the start of the input file, and some other XHTML features, explained next.
1 means accept XHTML input.
The only XHTML changes to this code, so far, are:
E.g.: <?xml version="1.0" standalone='yes'?>.
E.g.: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">.
Returns a hashref where the keys are the names of block-level HTML tags.
The corresponding values in the hashref are just 1.
Typical keys: address, form, p, table, tr.
Note: Some keys, e.g. tr, are also returned by "self_close()".
Returns the Tree::Simple object which the parser calls the current node.
Returns the nesting depth of the current tag.
The method is just here in case you need it.
Returns a hashref where the keys are the names of HTML tags of type empty.
Typical keys: area, base, input, wbr.
Returns a hashref where the keys are the names of HTML tags of type inline.
Typical keys: a, em, img, textarea.
Gets or sets the input file name used by "parse($input_file_name, $output_file_name)".
Note: The parameters passed in to "parse_file($input_file_name, $output_file_name)", take precedence over the input_file and output_file parameters passed in to new(), and over the internal values set with input_file($in_file_name) and output_file($out_file_name).
input_file($in_file_name)
output_file($out_file_name)
'input_file' is a parameter to "new()". See "Constructor and Initialization" for details.
Print $msg to STDERR if new() was called as new(verbose => 1), or if $p -> verbose(1) was called.
new(verbose => 1)
$p -> verbose(1)
Otherwise, print nothing.
This is the constructor. See "Constructor and initialization" for details.
Returns the type of the most recently created node, global, head, or body.
See the first question in the "FAQ" for details.
Gets or sets the output file name used by "parse($input_file_name, $output_file_name)".
'output_file' is a parameter to "new()". See "Constructor and Initialization" for details.
Returns the invocant. Thus $p -> parse returns $p. This allows for method chaining. See the "Synopsis".
$p -> parse
Parses the string of HTML in $html, and builds a tree of nodes.
After calling $p -> parse($html), you must call $p -> traverse($p -> root) before calling $p -> result.
$p -> parse($html)
$p -> traverse($p -> root)
$p -> result
Alternately, use $p -> parse_file, which calls all these methods for you.
$p -> parse_file
Note: parse() may be called directly or via parse_file().
parse()
parse_file()
Returns the invocant. Thus $p -> parse_file returns $p. This allows for method chaining. See the "Synopsis".
Parses the HTML in the input file, and writes the result to the output file.
parse_file() calls "parse($html)" and "traverse($node)", using $p -> root for $node.
$p -> root
Note: The parameters passed in to parse_file($input_file_name, $output_file_name), take precedence over the input_file and output_file parameters passed in to new(), and over the internal values set with input_file($in_file_name) and output_file($out_file_name).
parse_file($input_file_name, $output_file_name)
Lastly, the parameters passed in to parse_file($input_file_name, $output_file_name) are used to update the internal values set with the input_file and output_file parameters passed in to new(), or set with calls to input_file($in_file_name) and output_file($out_file_name).
Returns the string which is the result of the parse.
See scripts/parse.html.pl.
Returns the Tree::Simple object which the parser calls the root of the tree of nodes.
Returns a hashref where the keys are the names of HTML tags of type self close.
Typical keys: dd, dt, p, tr.
Note: Some keys, e.g. tr, are also returned by "block()".
Returns a string to be used as a regexp, to capture tags and their optional attributes.
It does not return qr/$s/; it just returns $s.
This regexp takes one of two forms, depending on the state of the xhtml option. See "xhtml($Boolean)".
The regexp has four (4) sets of capturing parentheses:
E.g.: <(....)>
E.g.: <(img)...>
E.g.: <img (src="/graph.svg" alt="A graph")>
E.g.: <img ... (/)>
Returns the invocant. Thus $p -> traverse returns $p. This allows for method chaining. See the "Synopsis".
$p -> traverse
Traverses the tree of nodes, starting at $node.
You normally call this as $p -> traverse($p -> root), to ensure all nodes are visited.
See the "Synopsis" for sample code.
Or, see scripts/traverse.file.pl, which uses HTML::Parser::Simple::Reporter, and calls traverse($node) via "traverse_file($input_file_name)" in HTML::Parser::Simple::Reporter.
traverse($node)
Gets or sets the verbose parameter.
'verbose' is a parameter to "new()". See "Constructor and Initialization" for details.
Gets or sets the xhtml parameter.
If you call this after object creation, the trigger feature of Moos is used to call "tagged_attribute()" so as to correctly set the regexp which recognises xhtml.
'xhtm'> is a parameter to "new()". See "Constructor and Initialization" for details.
The data of each node is a hash ref. The keys/values of this hash ref are:
This is the string of HTML attributes associated with the HTML tag.
Attributes are stored in lower-case.
So, <table align = 'center' summary = 'Body'> will have an attributes string of " align = 'center' summary = 'body'".
Note the leading space.
This is an arrayref of bits and pieces of content.
Consider this fragment of HTML:
<p>I did <i>not</i> say I <i>liked</i> debugging.</p>
When parsing 'I did ', the number of child nodes (of <p>) is 0, since <i> has not yet been detected.
So, 'I did ' is stored in the 0th element of the arrayref belonging to <p>.
Likewise, 'not' is stored in the 0th element of the arrayref belonging to the node <i>.
Next, ' say I ' is stored in the 1st element of the arrayref belonging to <p>, because it follows the 1st child node (<i>).
Likewise, ' debugging' is stored in the 2nd element of the arrayref belonging to <p>.
This way, the input string can be reproduced by successively outputting the elements of the arrayref of content interspersed with the contents of the child nodes (processed recusively).
Note: If you are processing this tree, never forget that there can be content after the last child node has been closed, but before the current node is closed.
Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.
The nesting depth of the tag within the document.
The root is at depth 0, '<html>' is at depth 1, '<head>' and '<body>' are a depth 2, and so on.
It's just there in case you need it.
So, the tag '<html>' will mean the name is 'html'.
Tag names are stored in lower-case.
The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.
The root has the node 'html' as the only child, of course.
This holds 'global' before '<head>' and between '</head>' and '<body>', and after '</body>'.
It holds 'head' for all nodes from '<head>' to '</head>', and holds 'body' from '<body>' to '</body>'.
Tags are stored in lower-case, in a tree managed by Tree::Simple.
Attributes are stored in the same case as in the original HTML.
The root of the tree is returned be "root()".
They are treated as content. This includes the prefix '<!--' and the suffix '-->'.
It is treated as content belonging to the root of the tree.
No, never.
Up to V 4.
Make yourself a nice cup of tea, and then fix your page.
No.
For example, if you feed in a HTML page without the title tag, this module does not care.
There are various ways.
Sample output:
http://savage.net.au/Perl-modules/html/CreateTable.html.
Preferably, see the previous question, or...
Suggested steps:
Note: There are quite a few files involved. Proceed with caution.
Call this input.html.
Reveal.pl ships with HTML::Revelation.
Call the output file output.1.html.
parse.html.pl ships with HTML::Parser::Simple.
Call the output file parsed.html.
Call the output file output.2.html.
If they match, or even if they don't match, you're finished.
Help with quirks: http://www.quirksmode.org/sitemap.html.
Yes. If your HTML file is not nice, the interpretation of tag nesting will not match your preconceptions.
In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.
The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.
I don't define 'a' to be inline, others do, e.g. http://www.w3.org/TR/html401/ and hence HTML::Tagset.
Inline means:
<a href = "#NAME"><div class = 'global_toc_text'>NAME</div></a>
will not be parsed as an 'a' containing a 'div'.
The 'a' tag will be closed before the 'div' is opened. So, the result will look like:
<a href = "#NAME"></a><div class = 'global_toc_text'>NAME</div>
To achieve what was presumably intended, use 'span':
<a href = "#NAME"><span class = 'global_toc_text'>NAME</span></a>
Some people (*cough* *cough*) have had to redo their entire websites due to this very problem.
Of course, this is just one of a vast set of possible problems.
You have been warned.
During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.
Late news: Tree does not cope with an arrayref stored in the metadata, so I've switched to Tree::DAG_Node.
Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.
That name sounds like a pure Perl version of the same API as used by HTML::Parser.
But the API's are not, and are not meant to be, compatible.
Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.
As always with OO code, sub-class! In this case, you write a new version of the traverse() method.
See HTML::Parser::Simple::Reporter, for example. It overrides "traverse($node)".
Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse(). Then call parse() and process().
Yes. See: git://github.com/ronsavage/html--parser--simple.git
I edit with UltraEdit. That means, in general, leading 4-space tabs.
All vertical alignment within lines is done manually with spaces.
Perl::Critic is off the agenda.
For this year's (2012) Google Code-in, I had a quick look at 122 class-building classes, and decided Moos was suitable, given it is pure-Perl and has the trigger feature I needed.
See http://savage.net.au/Module-reviews/html/gci.2012.class.builder.modules.html.
This Perl HTML parser has been converted from a JavaScript one written by John Resig.
http://ejohn.org/files/htmlparser.js.
Well done John!
Note also the comments published here:
http://groups.google.com/group/envjs/browse_thread/thread/edd9033b9273fa58.
HTML::Parser::Simple was written by Ron Savage <ron@savage.net.au> in 2009.
Home page: http://savage.net.au/index.html.
Australian copyright (c) 2009 Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License, a copy of which is available at: http://www.opensource.org/licenses/index.html
To install HTML::Parser::Simple, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Parser::Simple
CPAN shell
perl -MCPAN -e shell install HTML::Parser::Simple
For more information on module installation, please visit the detailed CPAN module installation guide.