Search::Tools::XML - methods for playing nice with XML and HTML
use Search::Tools::XML; my $class = 'Search::Tools::XML'; my $text = 'the "quick brown" fox'; my $xml = $class->start_tag('foo'); $xml .= $class->utf8_safe( $text ); $xml .= $class->end_tag('foo'); # $xml: <foo>the "quick brown" fox</foo> $xml = $class->escape( $xml ); # $xml: <foo>the &#34;quick brown&#34; fox</foo> $xml = $class->unescape( $xml ); # $xml: <foo>the "quick brown" fox</foo> my $plain = $class->no_html( $xml ); # $plain eq $text
IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.
Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.
NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.
Basic HTML/XML characters that must be escaped:
'>' => '>', '<' => '<', '&' => '&', '"' => '"', "'" => '''
Complete map of all named HTML entities to their decimal values.
The following methods may be accessed either as object or class methods.
Create a Search::Tools::XML object.
Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().
Create a valid XML tag name, escaping/omitting invalid characters.
Example:
my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' ); # $tag == '______tag_foo'
Return string with special XML chars and all non-ASCII chars converted to numeric entities.
This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.
no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.
text is returned with no markup in it.
Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their entity equivalents. See %Ents.
The escaped text is returned.
IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.
Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.
NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).
Replace all named HTML entities with their chr() equivalents.
text is modified in place.
Replace all decimal entities with their chr() equivalents.
Peter Karman perl@peknet.com
perl@peknet.com
Based on the CrayDoc regular expression building code, originally by the same author, copyright 2004 by Cray Inc.
Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Search::Tools
To install Search::Tools, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Search::Tools
CPAN shell
perl -MCPAN -e shell install Search::Tools
For more information on module installation, please visit the detailed CPAN module installation guide.