The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Daizu::HTML - functions for handling HTML and XHTML content

FUNCTIONS

The following functions are available for export from this module. None of them are exported by default.

dom_body_to_html4($doc, [$start_node], [$end_node])

Given an XML::LibXML::Document object for an XHTML document fragment, whose root element should be body, returns a string representation of the content in HTML 4 format.

$start_node and $end_node are both independently optional. If either is present then only part of the document will be presented in the HTML output. Both must be either undef or a node from the root (body) element of the document. $start_node should be the first node to be shown in the output, or undef to start from the beginning. $end_node should be the node after the last node to be output, or undef to end at the end of the document.

dom_node_to_html4($node)

Used by the dom_body_to_html4() function above to process individual nodes. The argument should be an XML::LibXML::Node object of some kind. Returns a string containing HTML 4 code, which for example will have text properly escaped.

dom_body_to_text($doc)

Given an XHTML body (as an XML::LibXML::Document object in the usually format) return a plain text version of the content, with some markup translatted into text formatting in a limited way to make it reasonably readable.

dom_filtered_for_feeds($doc)

Return a new version of the article content in $doc, with bits of markup which aren't relevant or might be unwelcome in feed content, such as script elements and style attributes. Also remove span elements because they're not needed when there's no custom styling, and Bloglines currently turns them into invalid HTML. Also remove class attributes in case they cause some unexpected styling to be applied.

In addition, any elements in the Daizu HTML extension namespace are removed. Elements in other non-XHTML namespaces will cause this function to fail. They shouldn't be there by the time the content is being output anyway.

Both $doc and the return value are XML::LibXML::Document objects of the kind returned by the article_doc() method in Daizu::File. The original DOM in $doc is not altered. The return value is a completely independent copy.

absolutify_links($doc, $base_url)

Given an XHTML document (as an XML::LibXML::Document object), find all the attributes in the markup which are relative URLs and turn them into absolute URLs relative to $base_url. This can be used to prepare content from an article to be published in a different place with a different URL, such as in an RSS feed or on an index page, while ensuring that any links or embedded files continue to work.

The document's elements must be in the XHTML namespace, or they will be ignored.

TODO - some of this could be refactored with the link replacing stuff in Daizu::Preview to be more thorough. For now though it just works on 'a href' and 'img src', since that will catch almost all cases.

html_escape_text($text)

Escape $text in a way which makes it safe to include in the content of HTML or XML elements. The characters <, >, and & are escaped. Returns the new value.

The output may not be suitable for including as the value of an HTML or XML attribute.

The return value is always formatted as bytes encoded in UTF-8.

html_escape_attr($text)

Escape $text in a way which makes it safe to include in the content of HTML or XML elements, or the values of HTML or XML attributes in double quotes. The characters <, >, &, and " are escaped. Returns the new value.

The return value is always formatted as bytes encoded in UTF-8.

COPYRIGHT

This software is copyright 2006 Geoff Richards <geoff@laxan.com>. For licensing information see this page:

http://www.daizucms.org/license/