HTML::Valid::Tagset - data tables useful in parsing HTML
use HTML::Valid::Tagset ':all'; for my $tag (qw/canvas a li moonshines/) { if ($isHTML5{$tag}) { print "<$tag> is ok\n"; } else { print "<$tag> is not HTML5\n"; } }
produces output
<canvas> is ok <a> is ok <li> is ok <moonshines> is not HTML5
(This example is included as tagset-synopsis.pl in the distribution.)
This documents HTML::Valid::Tagset version 0.07 corresponding to git commit 9945d9d51a2d7c8a1e6e92e46c77dc93ee02a2ab released on Sat Dec 30 13:41:36 2017 +0900.
This Perl module is built on top of the "HTML Tidy" library version 5.6.0.
This module contains several data tables useful in various kinds of HTML parsing operations.
All tag names used are lowercase.
This is a drop-in replacement for HTML::Tagset. However, HTML::Valid::Tagset is mostly not based on HTML::Tagset. It uses the tables of HTML elements from a C program called "HTML Tidy" (this is not the Perl module HTML::Tidy).
As far as possible, this module tries to be compatible with HTML::Tagset. Incompatibilities with HTML::Tagset are discussed in "Issues with HTML::Tagset".
If you need to validate tags, you should use, for example, "%isHTML5" for HTML 5 tags, or "%isKnown" if you want to check whether a tag is a known one.
In the following documentation, a "hashset" is a hash being used as a set. The actual values associated with the keys are not significant.
None of these variables are exported by default. See "EXPORTS". The compatibility with HTML::Tagset is listed. In all cases, the compatibility with HTML::Tagset refers to HTML::Tagset version 3.20.
This contains all the HTML tags that this module knows of as an array sorted in alphabetical order. It is exactly the same thing as the keys of "%isKnown".
This is only in HTML::Valid::Tagset, not in HTML::Tagset.
This is copied from HTML::Tagset.
This hashset has as values the tag names of elements that cannot have content. For example, "base", "br", or "hr".
use HTML::Valid::Tagset '%emptyElement'; for my $tag (qw/hr dl br snakeeyes/) { if ($emptyElement{$tag}) { print "<$tag> is empty.\n"; } else { print "<$tag> is not empty.\n"; } }
outputs
<hr> is empty. <dl> is not empty. <br> is empty. <snakeeyes> is not empty.
This is compatible with HTML::Tagset.
This hashset contains all block elements.
This hashset contains all elements that are to be found only in/under the "body" element of an HTML document.
This is compatible with the undocumented %HTML::Tagset::isBodyElement in HTML::Tagset and the documentation for %HTML::Tagset::isBodyMarkup. See also "Issues with HTML::Tagset". %isBodyMarkup is not implemented in HTML::Tagset, so it's not provided for compatibility here.
%HTML::Tagset::isBodyElement
%HTML::Tagset::isBodyMarkup
%isBodyMarkup
This hashset includes all elements whose content is CDATA.
This hashset contains all elements that are to be found only in/under a "form" element.
This hashset contains elements that can be present in the 'head' section of an HTML document.
This is compatible with the contents of %HTML::Tagset::isHeadElement, but not its documentation. See also "Issues with HTML::Tagset".
%HTML::Tagset::isHeadElement
This hashset includes all elements that can fall either in the head or in the body.
This hashset is true for elements which are part of the "HTML 2.0" standard.
This hashset is true for elements which are part of the "HTML 3.2" standard.
This hashset is true for elements which are part of the "HTML 4.01" standard.
use utf8; use FindBin '$Bin'; use HTML::Valid::Tagset '%isHTML5'; if ($isHTML5{canvas}) { print "<canvas> is OK.\n"; } if ($isHTML5{a}) { print "<a> is OK.\n"; } if ($isHTML5{plaintext}) { print "OH NO!"; } else { print "<plaintext> went out with scrambled eggs.\n"; }
<canvas> is OK. <a> is OK. <plaintext> went out with scrambled eggs.
This is true for elements which are valid HTML tags in "HTML5". It is not true for obsolete elements like the <plaintext> tag (see "%isObsolete"), or proprietary elements such as the <blink> tag which have never been part of any HTML standard (see "%isProprietary"). Further, some elements neither marked as obsolete nor proprietary are also not present in HTML5. For example the <isindex> tag is not present in HTML5.
This hashset lists all known HTML elements. See also "@allTags".
This hashset contains all elements that can contain "li" elements.
This hashset contains all inline elements. It is identical to %isPhraseMarkup.
%isPhraseMarkup
$isObsolete{canvas}; # Undefined $isObsolete{plaintext}; # True
This is true for HTML elements which were once part of HTML standards, like plaintext, but have now been declared obsolete. Note that %isObsolete is not true for elements like the <blink> tag which were never part of any HTML standard. See "%isProprietary" for these tags.
plaintext
%isObsolete
This hashset contains all inline elements. It is identical to %isInline.
%isInline
This is true for elements which are not part of any HTML standard, but were added by computer companies.
use utf8; use FindBin '$Bin'; use HTML::Valid::Tagset '%isProprietary'; my @tags = qw/a blink plaintext marquee/; for my $tag (@tags) { if ($isProprietary{$tag}) { print "<$tag> is proprietary.\n"; } else { print "<$tag> is not a proprietary tag.\n"; } }
<a> is not a proprietary tag. <blink> is proprietary. <plaintext> is not a proprietary tag. <marquee> is proprietary.
This hashset contains all elements that are to be found only in/under a "table" element.
Elements in this hashset are not empty (see "%emptyElement"), but their end-tags are generally, "safely", omissible.
use HTML::Valid::Tagset qw/%optionalEndTag %emptyElement/; for my $tag (qw/li p a br/) { if ($optionalEndTag{$tag}) { print "OK to omit </$tag>.\n"; } elsif ($emptyElement{$tag}) { print "<$tag> does not ever take '</$tag>'\n"; } else { print "Cannot omit </$tag> after <$tag>.\n"; } }
OK to omit </li>. OK to omit </p>. Cannot omit </a> after <a>. <br> does not ever take '</br>'
my $attr = all_attributes ();
This returns an array reference containing all known attributes. The attributes are not sorted.
my $attr = attributes ('a');
This returns an array reference containing all valid attributes for the specified tag (as decided by the WWW Consortium). The attributes are not sorted. By default this returns the valid tags for HTML 5.
It is also possible to choose a value for standard which specifies which standard one wants:
my $attr = attributes ('a', standard => 'html5');
Possible values for standard are
This returns valid attributes for "HTML5".
This is the default
This returns valid attributes for "HTML 4.01".
This returns valid attributes for "HTML 3.2".
This returns valid attributes for "HTML 2.0".
my $ok = tag_attr_ok ('a', 'onmouseover'); # $ok = 1 my $ok = tag_attr_ok ('table', 'cellspacing'); # $ok = undef, because "cellspacing" is not a valid attribute for # table in HTML 5.
This returns a true value if the attribute is allowed for the specified tag. The default version is HTML 5. Another version of HTML can be specified using the parameter standard:
standard
my $ok = tag_attr_ok ('html', 'onload', standard => 'html2');
The possible versions are as in "attributes".
my $type = attr_type ('onmouseover'); # $type = 'script'
This returns a text string containing likely type information for the attribute. This content is extracted from the internals of "HTML Tidy", and it may or may not be correct. This interface is experimental, and likely to change.
These variables are present in this module for compatibility with existing programs which use HTML::Tagset. However, they are fundamentally flawed and should not be used for new projects.
In HTML::Valid::Tagset, this is identical to "%isInline".
This is a mistake in HTML::Tagset which is preserved in name only for backwards compatibility. See also "Issues with HTML::Tagset".
In HTML::Valid::Tagset, this resolves to an empty list.
The following parts of HTML::Tagset are not implemented in version 0.07 of HTML::Valid::Tagset.
This is not implemented in HTML::Valid::Tagset.
This is a program and a library in C for improving HTML. It was originally written by Dave Raggett of the W3 Consortium. HTML::Valid is based on this project.
HTML Tidy web page
HTML Tidy git repository
Please note that this is not the Perl module HTML::Tidy by Andy Lester, although that module is also based on the above library.
HTML::Tagset, HTML::Element, HTML::TreeBuilder, HTML::LinkExtor
This section gives links to the HTML standards which HTML::Valid supports.
HTML 2.0 was described in RFC ("Request For Comments") 1866, a standard of the Internet Engineering Task Force. See http://www.ietf.org/rfc/rfc1866.txt.
This was described in the HTML 3.2 Reference Specification. See http://www.w3.org/TR/REC-html32.
This was described in the HTML 4.01 Specification. See http://www.w3.org/TR/html401/.
http://diveintohtml5.info/.
This isn't a standards document, but "Dive into HTML 5" may be good background reading before trying to read the standards documents.
This is at https://developers.whatwg.org/. It says
This specification is intended for authors of documents and scripts that use the features defined in this specification.
This is at http://www.w3.org/TR/html5/. It's the W3 consortium's version of the WHATWG documents.
The hashes and arrays are exported on demand. Everything can be exported with :all:
:all
export HTML::Valid::Tagset ':all';
There are several problems with HTML::Tagset version 3.20 which mean that it's difficult to be fully compatible with it.
@p_closure_barriers
There is a long-winded argument in the documentation of HTML::Tagset, which has been there since version 3.01, released on Aug 21 2000, about why it's possible for a p element to contain another p element. However, the specification for HTML4.01, which HTML::Tagset seems to be based on, from 1999, states
The P element represents a paragraph. It cannot contain block-level elements (including P itself).
Thus, it is simply not possible for any block element to legally be part of a paragraph, and the mechanism that HTML::Tagset suggests for how a paragraph element can contain a table which can contain a paragraph element, like this:
<p> <table>
is not and was not legal HTML, since <table> itself is a block level element, and the HTML rule is that in the above case, if a new block level element is seen, a </p> is inserted automatically, so it always becomes
<p> </p> <table>
anyway. See "%isBlock" for testing for whether an element is a block level element.
So in this module, "@p_closure_barriers" is an empty set.
%is_Possible_Strict_P_Content
The comments for HTML::Tagset version 3.20 read
# I've no idea why there's these latter exceptions. # I'm just following the HTML4.01 DTD.
and following this it lists the form tag in this hash. However, the form tag is a block level element, so the purpose of this hash seems to be misguided. Since, as noted above, a p tag can contain any inline element, in this module, for compatibility, "%is_Possible_Strict_P_Content" is just the same thing as "%isInline".
form
The documented %isBodyMarkup doesn't exist, in its place is %isBodyElement.
%isBodyElement
This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109024.
%isHeadElement
The documentation of %isHeadElement claims
This hashset contains all elements that elements that should be present only in the 'head' element of an HTML document.
However, in fact it actually contains elements that can be present either only in the head, like <title>, or both in the head and the body, like <script>. In this module, "%isHeadElement" copies the contents of HTML::Tagset rather than its documentation.
The issue in HTML::Tagset is reported as https://rt.cpan.org/Ticket/Display.html?id=109044.
This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109018.
Portions of this module are taken from HTML::Tagset, which bears the following copyright notice.
Copyright 1995-2000 Gisle Aas.
Copyright 2000-2005 Sean M. Burke.
Copyright 2005-2008 Andy Lester.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
However, the bulk of HTML::Valid::Tagset is not a fork of HTML::Tagset, it is based on "HTML Tidy".
HTML::Valid is based on HTML Tidy, which is under the following copyright:
Copyright (c) 1998-2008 World Wide Web Consortium (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All Rights Reserved. COPYRIGHT NOTICE: This software and documentation is provided "as is," and the copyright holders and contributing author(s) make no representations or warranties, express or implied, including but not limited to, warranties of merchantability or fitness for any particular purpose or that the use of the software or documentation will not infringe any third party patents, copyrights, trademarks or other rights. The copyright holders and contributing author(s) will not be held liable for any direct, indirect, special or consequential damages arising out of any use of the software or documentation, even if advised of the possibility of such damage. Permission is hereby granted to use, copy, modify, and distribute this source code, or portions hereof, documentation and executables, for any purpose, without fee, subject to the following restrictions: 1. The origin of this source code must not be misrepresented. 2. Altered versions must be plainly marked as such and must not be misrepresented as being the original source. 3. This Copyright notice may not be removed or altered from any source or altered source distribution.
The Perl parts of this distribution are copyright (C) 2015-2017 Ben Bullock and may be used under either the above licence terms, or the usual Perl conditions, either the GNU General Public Licence or the Perl Artistic Licence.
To install HTML::Valid, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Valid
CPAN shell
perl -MCPAN -e shell install HTML::Valid
For more information on module installation, please visit the detailed CPAN module installation guide.