HTML::Detoxifier - practical module to strip harmful HTML
use HTML::Detoxifier qw<detoxify>; my $clean_html = detoxify $html; my $cleaner_html = detoxify($html, disallow => [qw(dynamic images document)]); my $stripped_html = detoxify($html, disallow => [qw(everything)]);
HTML::Detoxifier is a practical module to remove harmful tags from HTML input. It's intended to be used for web sites that accept user input in the form of HTML and then present that information in some form.
In addition to this main purpose, HTML::Detoxifier cleans up some common mistakes with HTML: all tags are closed, empty tags are converted to valid XML (that is, with a trailing /), and images without ALT text as required in HTML 4.0 are given a plain ALT tag. The module does its best to emit valid XHTML 1.0; it even adds XML declarations and DOCTYPE elements where needed.
The following groups can be disallowed or allowed as you choose. Some tags are present in more than one group. In these cases, the tag must be present in every allowed group, or the tag will be removed.
Markup that defines the basic structure of a document (e.g. html, head, body).
Markup that alters the appearance of text (e.g. strong, strike, b, i, em).
Markup that can alter the size of text (e.g. big, small).
Most block-level markup as defined in the HTML4 specification.
Markup used to create fill-in forms.
Markup that creates tables or otherwise controls page layout.
Markup that creates images.
Markup that creates "annoying" effects undesirable by the majority of web users (marquee, blink).
Usually seldom-used, typically-harmless HTML tags that specify special types of inline text. (e.g. abbr, dd, span).
Call detoxify to detoxify html with the given options. The most common key in for the options hash is disallow, which disallows certain features of HTML. See above for the list of acceptable values. Pass a reference to an array of strings specifying groups as the value to the optional disallow hash. You may also specify allow_only, which has the same syntax but performs the reverse action: only the specified tag sets are allowed. If no options are specified, dynamic content only is removed.
If you want to detoxify a document in multiple stages, set the section key in the options hash to the value 'first' on the first page and 'next' on every subsequent page. This will postpone the tag closing mechanism until you pass 'last' as the value to the section key.
Patrick Walton <firstname.lastname@example.org>
Copyright (c) 2004 Patrick Walton. You may redistribute this module under the same terms as Perl itself. For more information, see the appropriate LICENSE file.