The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::ListScraper::Interactive - formatting data from HTML::ListScraper

FUNCTIONS

format_tags

Formats a tag sequence to emphasize its tree-like structure. Takes 2 or 3 parameters: a HTML::ListScraper object, array reference containing HTML::ListScraper::Tag objects and an optional hash with formatting options. format_tags returns an array (array reference if called in a scalar context) with formatted tag names and text.

The formatting options are

attr

Include the href attribute in the output.

text

Include the plain text in the output.

index

Include tag positions in the output.

The returned values are basically XHTML lines: opening tags, text with quoted entities and closing tags. Tags are enclosed in angle brackets. The returned values don't necessarily form a valid XML fragment, though, i.e. because the input tags need not form a tree.

When index is set, tag values start with the tag's index, followed by a tab. Next, spaces show indentation. An opening tag not identified as missing a closing tag increases indentation by 2 spaces, a closing tag decreases it back. An opening tag with missing closing tag is output with '/' appended to its name. For the rules of associating opening and closing tags, see HTML::ListScraper::shapeless.

When attr is set, links are formatted without whitespace and enclosed in double quotes. Double quotes in links are escaped, but no other characters are (which can also make the result invalid HTML). When text is set, the output text has normalized whitespace; nodes containing only whitespace are dropped. Gaps between adjacent tag positions are displayed as an empty line. All values end with a newline.

canonicalize_tags

Undoes the formatting done by format_tags. Takes a list of lines such as those output by format_tags when called without any formatting options and converts them to a list of tag names. Note that canonicalize_tags doesn't handle attributes, text lines nor index numbers.