Jeffrey Kegler > Marpa-HTML > html_fmt

Download:
Marpa-HTML-0.112000.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

html_fmt - Reformat HTML, indented according to structure

SYNOPSIS ^

    html_fmt [uri|file]

EXAMPLE ^

    html_fmt http://perl.org

DESCRIPTION ^

Given the URI or the name of a file, writes it to STDOUT reformatted and indented according to the HTML structure. Missing start and end tags are supplied and comments added to indicate this. Text inside <pre> elements is not altered.

html_fmt tries to parse everything that is actually out there on the Web. In fact, html_fmt will assume any file fed to it was intended as HTML, and will produce its best guess of the author's intent.

html_fmt supplies missing start and end tags. html_fmt's parser is extremely liberal in what it accepts. When its liberalization of the standards is not sufficient to make a document into valid HTML, html_fmt will pick characters to treat as noise or "cruft". The parser ignores cruft in determining the structure of the document.

When html_fmt adds a missing start tag, it precedes the new start tag with a comment. When html_fmt adds a missing end tag, it follows the new end tag with a comment. When html_fmt classifies characters as "cruft", it adds a comment to that effect before the "cruft".

pre elements receive special treatment. The contents of pre elements are not reformatted. When missing tags or cruft occur inside a pre element, the comments to that effect are placed before the <pre> start tag.

The argument to html_score can be either as a URI or a file name. If it starts with alphanumerics followed by a colon, it is treated as a URI. Otherwise it is treated as file name.

SAMPLE OUTPUT ^

Given this input:

    <title>Test page<tr>x<head attr="I am cruft"><p>Final graf

html_fmt returns

    <!-- Following start tag is replacement for a missing one -->
    <html>
      <!-- Following start tag is replacement for a missing one -->
      <head>
        <title>
          Test page
        </title>
        <!-- Preceding end tag is replacement for a missing one -->
      </head>
      <!-- Preceding end tag is replacement for a missing one -->
      <!-- Following start tag is replacement for a missing one -->
      <body>
        <!-- Following start tag is replacement for a missing one -->
        <table>
          <!-- Following start tag is replacement for a missing one -->
          <tbody>
            <tr>
              <!-- Following start tag is replacement for a missing one -->
              <td>
                x
                <!-- Next line is cruft -->
                <head attr="I am cruft">
                <p>
                  Final graf
                </p>
                <!-- Preceding end tag is replacement for a missing one -->
              </td>
              <!-- Preceding end tag is replacement for a missing one -->
            </tr>
            <!-- Preceding end tag is replacement for a missing one -->
          </tbody>
          <!-- Preceding end tag is replacement for a missing one -->
        </table>
        <!-- Preceding end tag is replacement for a missing one -->
      </body>
      <!-- Preceding end tag is replacement for a missing one -->
    </html>
    <!-- Preceding end tag is replacement for a missing one -->

PURPOSE ^

This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::HTML. And the purpose of Marpa::HTML is to demonstrate the power of its parse engine, Marpa. Marpa::HTML was written in a few days, and its logic is a straightforward, natural expression of the structure of HTML.

ACKNOWLEDGMENTS ^

The starting template for this code was HTML::TokeParser, by Gisle Aas. See also the acknowledgments for Marpa as a whole.

LICENSE AND COPYRIGHT ^

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.

syntax highlighting: