
untemplate - analyze several HTML documents based on the same template

version 0.016

untemplate [options] HTML1 HTML2 [HTML3] [...]

Takes multiple HTML documents generated using the same template and attempts to extract only the data inserted into original template.
Accepts URL if AnyEvent::Net::Curl::Queued is present.

This.
Specify the HTML document encoding (latin1, utf8). UTF-8 is assumed by default.
Enable syntax highlight for XPath. By default, enabled automatically on interactive terminals.
Disables the --color option and highlights using HTML/CSS.
Enable the display of "partial" templates, that is, nodes present in some documents. By default, only the nodes present in all documents are displayed.
Shrink the XPath to the minimal unique identifier. For example:
/html/body[@id='cpansearch']/form[@class='searchbox']/input[@name='query']
Could be shortened as:
//input[@name='query']
The shrinking is enabled by default.
Strict mode disables grouping by id, class or name attributes. The grouping is enabled by default.
Specify regex(es) to unmangle id/class attributes. Some CMS (WordPress) insert unique identifiers into HTML elements, like:
<body class="post-id-12345">
This tend to break HTML tree analysis. To fix the above case, use --unmangle 'post-id-\d+'. Multiple unmanglers are accepted (--unmangle a --unmangle b).

untemplate --color http://bash.org/?1839 http://bash.org/?2486 | less -R

Trying to untemplate HTML documents not based on the same template, the results will be empty.
Unfortunately, employing any kind of document identifier as part of element class/id (common practice in WordPress themes) is enough to constitute "not same template".
See the --unmangle option for a work-around.

Stanislaw Pusep <stas@sysd.org>

This software is copyright (c) 2013 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.