The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Tools::XML - methods for playing nice with XML and HTML

SYNOPSIS

 use Search::Tools::XML;
 
 my $class = 'Search::Tools::XML';
 
 my $text = 'the "quick brown" fox';
 
 my $xml = $class->start_tag('foo');
 
 $xml .= $class->utf8_safe( $text );
 
 $xml .= $class->end_tag('foo');
 
 # $xml: <foo>the &#34;quick brown&#34; fox</foo>
 
 $xml = $class->escape( $xml );
 
 # $xml: &lt;foo&gt;the &amp;#34;quick brown&amp;#34; fox&lt;/foo&gt;
 
 $xml = $class->unescape( $xml );
 
 # $xml: <foo>the "quick brown" fox</foo>
 
 my $plain = $class->no_html( $xml );
 
 # $plain eq $text
 
 

DESCRIPTION

IMPORTANT: The API for escape() and unescape() has changed as of version 0.16. The text is no longer modified in place, as this was less intuitive.

Search::Tools::XML provides utility methods for dealing with XML and HTML. There isn't really anything new here that CPAN doesn't provide via HTML::Entities or similar modules. The difference is convenience: the most common methods you need for search apps are in one place with no extra dependencies.

NOTE: To get full UTF-8 character set from chr() you must be using Perl >= 5.8. This affects things like the unescape* methods.

VARIABLES

%Ents

Basic HTML/XML characters that must be escaped:

 '>' => '&gt;',
 '<' => '&lt;',
 '&' => '&amp;',
 '"' => '&quot;',
 "'" => '&apos;'
 

%HTML_ents

Complete map of all named HTML entities to their decimal values.

METHODS

The following methods may be accessed either as object or class methods.

new

Create a Search::Tools::XML object.

start_tag( string )

end_tag( string )

Returns string as a tag, either start or end. string will be escaped for any non-valid chars using tag_safe().

tag_safe( string )

Create a valid XML tag name, escaping/omitting invalid characters.

Example:

        my $tag = Search::Tools::XML->tag_safe( '1 * ! tag foo' );
    # $tag == '______tag_foo'

utf8_safe( string )

Return string with special XML chars and all non-ASCII chars converted to numeric entities.

This is escape() on steroids. Do not use them both on the same text unless you know what you're doing. See the SYNOPSIS for an example.

no_html( text )

no_html() is a brute-force method for removing all tags and entities from text. A simple regular expression is used, so things like nested comments and the like will probably break. If you really need to reliably filter out the tags and entities from a HTML text, use HTML::Parser or similar.

text is returned with no markup in it.

escape( text )

Similar to escape() functions in more famous CPAN modules, but without the added dependency. escape() will convert the special XML chars (><'"&) to their entity equivalents. See %Ents.

The escaped text is returned.

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

unescape( text )

Similar to unescape() functions in more famous CPAN modules, but without the added dependency. unescape() will convert all entities to their chr() equivalents.

NOTE: unescape() does more than reverse the effects of escape(). It attempts to resolve all entities, not just the special XML entities (><'"&).

IMPORTANT: The API for this method has changed as of version 0.16. text is no longer modified in-place.

unescape_named( text )

Replace all named HTML entities with their chr() equivalents.

text is modified in place.

unescape_decimal( text )

Replace all decimal entities with their chr() equivalents.

text is modified in place.

AUTHOR

Peter Karman perl@peknet.com

Based on the CrayDoc regular expression building code, originally by the same author, copyright 2004 by Cray Inc.

COPYRIGHT

Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Search::Tools