The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Sanitizer - HTML Sanitizer

SYNOPSIS

  my $safe = new HTML::Sanitizer;

  $safe->permit_only(
        qw/ strong em /,
        a => {
                href => qr/^(?:http|ftp):/,
                title => 1,
        },
        img => {
                src => qr/^(?:http|ftp):/,
                alt => 1,
        },
        b => HTML::Element->new('strong'),
  );

  $sanitized = $safe->filter_html_fragment($evil_html);

  # or

  my $tree = HTML::TreeBuilder->new->parse_file($filename);
  $safe->sanitize_tree($tree);

ABSTRACT

This module acts as a filter for HTML. It is not a validator, though it might be possible to write a validator-like tool with it. It's intended to strip out unwanted HTML elements and attributes and leave you with non-dangerous HTML code that you should be able to trust.

DESCRIPTION

First, though this module attempts to strip out unwanted HTML, I make no guarantee that it will be unbeatable. Tread lightly when using untrusted data. Also take note of the low version number.

RULE SETUP

See the "RULE SETS" section below for details on what a rule set actually is. This section documents the methods you'd use to set one up.

new(...)

Creates a new HTML::Sanitizer object, using the given ruleset. Alternatively, a ruleset can be built piecemeal using the permit/deny methods described below.

See the section on "RULE SETS" below to see how to construct a filter rule set. An example might be:

  $safe = new HTML::Sanitizer(
     strong => 1,                       # allow <strong>, <em> and <p>
     em => 1,
     p => 1,
     a => { href => qr/^http:/ },       # allow HTTP links
     b => HTML::Element->new('strong'), # convert <b> to <strong>
     '*' => 0,                          # disallow everything else
  );
permit(...)

Accepts a list of rules and assumes each rule will have a true value. This allows you to be a little less verbose, since your rule sets can look like this instead of a large data structure:

  $safe->permit( qw/ strong em i b br / );

Though you're still free to include attributes and more complex validation requirements, if you still need them:

  $safe->permit( img => [ qw/ src alt / ], ... );

  $safe->permit( a => { href => qr/^http:/ }, 
                 blockquote => [ qw/ cite id / ], 
                 b => HTML::Element->new('strong'),
                 qw/ strong em /);

The value to each element should be an array, hash or code reference, or an HTML::Element object, since the '=> 1' is always implied otherwise.

permit_only(...)

Like permit, but also assumes a default 'deny' policy. This is equivalent to including this in your ruleset as passed to new():

  '*' => undef

This will destroy any existing rule set in favor of the one you pass it.

If you would rather use a default 'ignore' policy, you could do something like this:

  $safe->permit_only(...);
  $safe->ignore('*');
deny(...)

Like permit, but assumes each case will have a 'false' value by assuming a '=> undef' for each element that isn't followed by an array reference. This will cause any elements matching these rules to be stripped from the document tree (along with any child elements). You cannot pass a hash reference of attributes, a code reference or an HTML::Element object as a value to an element, as in permit. If you need more complex validation requirements, follow up with a permit() call or define them in your call to new().

  $safe->deny( a => ['href'], qw/ img object embed script style /);
deny_only(...)

Like deny, but assumes a default 'permit' policy. This is equivalent to including this in your ruleset:

  '*' => { '*' => 1 }   # allow all elements and all attributes

This will destroy any existing rule set in favor of the one you pass it.

ignore(...)

Very similar to deny, this will cause a rule with an implied '=> 0' to be created for the elements passed. Matching elements will be replaced with their child elements, with the element itself being removed from the document tree.

ignore_only(...)

Like ignore, but assumes a default 'permit' policy. See 'deny_only'.

FILTER METHODS

sanitize_tree($tree)

This runs the filter on a parse tree, as generated by HTML::TreeParser. This WILL modify $tree. This function is used by the filter functions below, so you don't have to deal with HTML::TreeParser unless you want to.

filter_html($html)

Filters an HTML document using the configured rule set.

filter_html_fragment($html)

Filters an HTML fragment. Use this if you're filtering a chunk of HTML that you're going to end up using within an existing document. (In other words, it operates on $html as if it were a complete document, but only ends up working on children of the <body> tag.)

filter_xml($xml)
filter_xml_fragment($xml)

Like above, but operates on the data as though it were well-formed XML. Use this if you intend on providing XHTML, for example.

When the above functions encounter an attribute they're meant to filter, the attribute will be deleted from the element, but the element will survive. If you need to delete the entire element if an attribute doesn't pass validation, set up a coderef for the element in your rule set and use HTML::Element methods to manipulate the element (e.g. by calling $element-delete> or $element-replace_with_content> if $element-attr('href')> doesn't pass muster.)

RULE SETS

A rule set is simply a list of elements and/or attributes and values indicating whether those elements/attributes will be allowed, ignored, or stripped from the parse tree. Generally rule sets should be passed to new() at object creation time, though they can also be built piecemeal through calls to permit, deny and/or ignore as described above.

Each element in the list should be followed by one of the following:

a 'true' value

This indicates the element should be permitted as-is with no filtering or modification (aside from any other filtering done to child elements).

0

If a zero (or some other defined, false value) is given, the element itself is deleted but child elements are brought up to replace it. Use this when you wish to filter a bad formatting tag while preserving the text it was formatting, for example.

undef

If an undef is given, the element and all of its children will be deleted. This would remove a scripting tag and all of its contents from the document tree, for example.

an HTML::Element object

A copy of this object will replace the element matching the rule. The attributes in the replacement object will overlay the attributes of the original object (after attribute filtering has been done through the _ rule). If this element contains any child elements, they will replace the children of the element fitting the rule. If you wish to delete the content without necessarily providing any replacement, create a child that's simply an empty text node.

a code reference

This would permit the element if, and only if, the coderef returned a true value. The HTML::Element object in question is passed as the first and only argument.

a hash reference

This implies the element itself is OK, but that some additional checking of its attribute list is needed. This hash reference should contain keys of attributes and values that in turn should be one of:

a 'true' value

This would preserve the attribute.

a 'false' value

This would delete the attribute.

a regular expression

This would preserve the attribute if the regular expression matched.

a code reference

This would permit the attribute if and only if the coderef returned a true value. The HTML::Element object, the attribute name and attribute value are passed as arguments. $_ is also set to the attribute value (which can be modified).

EXAMPLES

Here is a sample rule set, which might do a fair job at stripping out potentially dangerous tags, though I put this together without too much thought, so I wouldn't rely on it:

  'script'          => undef,
  'style'           => undef,
  '*'               => {
        onclick     => 0,
        ondblclick  => 0,
        onselect    => 0,
        onmousedown => 0,
        onmouseup   => 0,
        onmouseover => 0,
        onmousemove => 0,
        onmouseout  => 0,
        onfocus     => 0,
        onblur      => 0,
        onkeypress  => 0,
        onkeydown   => 0,
        onkeyup     => 0,
        onselect    => 0,
        onload      => 0,
        onunload    => 0,
        onerror     => 0,
        onsubmit    => 0,
        onreset     => 0,
        onchange    => 0,
        style       => 0,
        href        => qr/^(?!(?:java)?script)/,
        src         => qr/^(?!(?:java)?script)/,
        cite        => sub { !/^(?:java)?script/ },  # same thing, mostly
        '*'         => 1,
  },
  'link'            => {
        rel         => sub { not_member("stylesheet", @_) },
  },
  'object'          => 0,       # strip but let children show through
  'embed'           => undef,
  'iframe'          => undef,
  'frameset'        => undef,
  'frame'           => undef,
  'applet'          => undef,
  'noframes'        => 0,
  'noscript'        => 0,

  # use a function like this to do some additional validation:

  sub not_member { !/\b\Q$_[0]\E\b/i; } # maybe substitute it out instead

A web site incorporating user posts might want something a little more strict:

  em           => 1,
  strong       => 1,
  p            => 1,
  ol           => 1,
  ul           => 1,
  li           => 1,
  tt           => 1,
  a            => 1,
  img          => 1,
  span         => 1,
  blockquote   => { cite => 1 },
  _            => {      # for all tags above, these attribute rules apply:
      href     => qr/^(?:http|ftp|mailto|sip):/i,
      src      => qr/^(?:http|ftp|data):/i,
      title    => 1,
                  # Maybe add an x- prefix to all ID's to avoid collisions
      id       => sub { $_ = "x-$_" },
      xml:lang => 1,
      lang     => 1,
      *        => 0,
  },
  '*'          => 0,     # everything else is 'ignored'
  script       => undef, # except these, which are stripped along with children
  style        => undef,

Note the use of the _ element here, which is magic in that it allows you to set up some global attributes while still leaving the * element free to express a default 'deny' policy. The attributes specified here will be applied to all of the explicitly defined elements (em, strong, etc.), but they will not be applied to elements not present in the ruleset.

Attribute rule precedence goes from the tag-specific, the special "_" tag and then the special "*" tag.

The following might be a simple way to force a 'b' tag to become a 'strong' tag, with the text within it surviving:

  b => HTML::Element->new('strong');

Here's how you might strip out a 'script' tag while letting the user know something is up:

  script => HTML::Element
        ->new('p', class => 'script_warning')
        ->push_content("Warning: A <script> tag was removed!");

OTHER CONSIDERATIONS

This module just deals with HTML tags. There are other ways of injecting potentially harmful code into documents, including CSS, faking out an img or object tag, etc. Without extending this module to include a CSS parser, for example, addressing these cases will be difficult. It's recommended that tags and attributes like this simply be stripped.

If you're using this to sanitize code provided by a 3rd party, also check to ensure that you're either matching character sets, or converting as necessary.

BUGS

This release has no known bugs, but prior releases may have contained bugs that were fixed with this release. See http://rt.cpan.org/ for details.

SEE ALSO

HTML::TreeBuilder, HTML::Element, HTML::Parser, Safe

AUTHOR

Copyright (c) 2003 David Nesting. All Rights Reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 543:

Expected text after =item, not a number

Around line 599:

=back doesn't take any parameters, but you said =back 4

Around line 601:

=back doesn't take any parameters, but you said =back 4