HTML::Tidy::libXML - Tidy HTML via XML::LibXML
$Id: libXML.pm,v 0.2 2009/02/21 11:47:58 dankogai Exp dankogai $
use HTML::Tidy::libXML; my $tidy = HTML::Tidy::libXML->new(); my $xml = $tidy->clean($html, $encoding); # clean enough as xml my $xhtml = $tidy->clean($html, $encoding, 1); # clean enough for browsers
none.
Creates an object.
my $tidy = HTML::Tidy::libXML->new();
my $dom = $tidy->html2dom($string, $encoding);
This is analogus to
my $lx = XML::LibXML->new; $lx->recover_silently(1); my $dom = $lx->parse_html_string($string);
Except one major difference. HTML::Tidy::LibXML does not trust <meta http-equiv="content-type" content="text/html; charset="foo"> while XML::LibXML tries to use one. Consider this;
<meta http-equiv="content-type" content="text/html; charset="foo">
my $dom = $lx->parse_html_string('http://example.com');
This kinda works since XML::LibXML is capable of fetching document directly. But XML::LibXML does not honor HTTP header. Here is the better practice.
require LWP::UserAgent; require HTTP::Response::Encoding; my $uri = shift || die; my $res = LWP::UserAgent->new->get($uri); die $res->status_line unless $res->is_success; my $dom = $tidy->html2dom($res->content, $res->encoding);
my $tidy->com2xml($dom, $level);
Tidies $dom which is XML::LibXML::Document object and returns an XML string. If the level is ommitted, the resulting XML is good enough as XML -- valid but not very browser compliant (like <br clear="">, <a name="here" />). Set level to 1 or above for tidier, browser-compliant xhtml.
$dom
<br clear="">
<a name="here" />
my $xml = $tidy->html2xml($html, $encoding, $level)
Which is the shorthand for:
my $dom = $tidy->html2dom($html, $encoding); my $xml = $tidy->dom2xml($dom, $level);
An alias to html2xml.
html2xml
This is what happened trying to tidy http://www.perl.com/ on my PowerBook Pro. See t/bench.pl for details.
Rate H::T H::T::LibXML(1) H::T::LibXML(0) H::T 96.2/s -- -25% -49% H::T::LibXML(1) 128/s 33% -- -31% H::T::LibXML(0) 187/s 95% 46% --
Dan Kogai, <dankogai at dan.co.jp>
<dankogai at dan.co.jp>
Please report any bugs or feature requests to bug-html-tidy-libxml at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Tidy-libXML. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-html-tidy-libxml at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc HTML::Tidy::libXML
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Tidy-libXML
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/HTML-Tidy-libXML
CPAN Ratings
http://cpanratings.perl.org/d/HTML-Tidy-libXML
Search CPAN
http://search.cpan.org/dist/HTML-Tidy-libXML/
HTML::Tidy, XML::LibXML
Copyright 2009 Dan Kogai, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::Tidy::libXML, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Tidy::libXML
CPAN shell
perl -MCPAN -e shell install HTML::Tidy::libXML
For more information on module installation, please visit the detailed CPAN module installation guide.