The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Tidy::libXML - Tidy HTML via XML::LibXML

VERSION

$Id: libXML.pm,v 0.2 2009/02/21 11:47:58 dankogai Exp dankogai $

SYNOPSIS

  use HTML::Tidy::libXML;
  my $tidy = HTML::Tidy::libXML->new();
  my $xml   = $tidy->clean($html, $encoding);    # clean enough as xml
  my $xhtml = $tidy->clean($html, $encoding, 1); # clean enough for browsers

EXPORT

none.

Functions

new

Creates an object.

  my $tidy = HTML::Tidy::libXML->new();

html2dom

  my $dom = $tidy->html2dom($string, $encoding);

This is analogus to

  my $lx = XML::LibXML->new;
  $lx->recover_silently(1);
  my $dom = $lx->parse_html_string($string);

Except one major difference. HTML::Tidy::LibXML does not trust <meta http-equiv="content-type" content="text/html; charset="foo"> while XML::LibXML tries to use one. Consider this;

  my $dom = $lx->parse_html_string('http://example.com');

This kinda works since XML::LibXML is capable of fetching document directly. But XML::LibXML does not honor HTTP header. Here is the better practice.

  require LWP::UserAgent;
  require HTTP::Response::Encoding;
  my $uri = shift || die;
  my $res = LWP::UserAgent->new->get($uri);
  die $res->status_line unless $res->is_success;
  my $dom = $tidy->html2dom($res->content, $res->encoding);

dom2xml

  my $tidy->com2xml($dom, $level);

Tidies $dom which is XML::LibXML::Document object and returns an XML string. If the level is ommitted, the resulting XML is good enough as XML -- valid but not very browser compliant (like <br clear="">, <a name="here" />). Set level to 1 or above for tidier, browser-compliant xhtml.

html2xml

  my $xml = $tidy->html2xml($html, $encoding, $level)

Which is the shorthand for:

  my $dom = $tidy->html2dom($html, $encoding);
  my $xml = $tidy->dom2xml($dom, $level);

clean

An alias to html2xml.

BENCHMARK

This is what happened trying to tidy http://www.perl.com/ on my PowerBook Pro. See t/bench.pl for details.

                    Rate            H::T H::T::LibXML(1) H::T::LibXML(0)
  H::T            96.2/s              --            -25%            -49%
  H::T::LibXML(1)  128/s             33%              --            -31%
  H::T::LibXML(0)  187/s             95%             46%              --

AUTHOR

Dan Kogai, <dankogai at dan.co.jp>

BUGS

Please report any bugs or feature requests to bug-html-tidy-libxml at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Tidy-libXML. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc HTML::Tidy::libXML

You can also look for information at:

ACKNOWLEDGEMENTS

HTML::Tidy, XML::LibXML

COPYRIGHT & LICENSE

Copyright 2009 Dan Kogai, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.