Станислав Пусеп > HTML-Untemplate-0.017 > untemplate

Download:
HTML-Untemplate-0.017.tar.gz

Annotate this POD

Website

View/Report Bugs
Source   Latest Release: HTML-Untemplate-0.018

NAME ^

untemplate - analyze several HTML documents based on the same template

VERSION ^

version 0.017

SYNOPSIS ^

    untemplate [options] HTML1 HTML2 [HTML3] [...]

DESCRIPTION ^

Takes multiple HTML documents generated using the same template and attempts to extract only the data inserted into original template.

Accepts URL if AnyEvent::Net::Curl::Queued is present.

OPTIONS ^

--help

This.

--encoding=name

Specify the HTML document encoding (latin1, utf8). UTF-8 is assumed by default.

--[no]color

Enable syntax highlight for XPath. By default, enabled automatically on interactive terminals.

--16

Use 16 system colors. By default, try to use 256-color ANSI palette.

--[no]html

Disables the --color option and highlights using HTML/CSS.

--[no]partial

Enable the display of "partial" templates, that is, nodes present in some documents. By default, only the nodes present in all documents are displayed.

--[no]shrink

Shrink the XPath to the minimal unique identifier. For example:

    /html/body[@id='cpansearch']/form[@class='searchbox']/input[@name='query']

Could be shortened as:

    //input[@name='query']

The shrinking is enabled by default.

--[no]strict

Strict mode disables grouping by id, class or name attributes. The grouping is enabled by default.

--unmangle=regex

Specify regex(es) to unmangle id/class attributes. Some CMS (WordPress) insert unique identifiers into HTML elements, like:

    <body class="post-id-12345">

This tend to break HTML tree analysis. To fix the above case, use --unmangle 'post-id-\d+'. Multiple unmanglers are accepted (--unmangle a --unmangle b).

EXAMPLES ^

    untemplate --color http://bash.org/?1839 http://bash.org/?2486 | less -R

CAVEATS ^

Trying to untemplate HTML documents not based on the same template, the results will be empty.

Unfortunately, employing any kind of document identifier as part of element class/id (common practice in WordPress themes) is enough to constitute "not same template".

See the --unmangle option for a work-around.

AUTHOR ^

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2013 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: