The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::ParseDTD - parses a XML DTD and provides methods to access the information stored in the DTD.

DEPENDENCIES

Perl Version

        5.004

Standard Modules

        Carp 1.01

Nonstandard Modules

        LWP::UserAgent 0.01
        Cache::Cache 1.02

SYNOPSIS

    use XML::ParseDTD;
    $dtd = XML::ParseDTD->new($dtd);
    $bool = $dtd->child_allowed($tag, $childtag);
    $bool = $dtd->child_list_allowed($tag, @childtags);
    $bool = $dtd->attr_allowed($tag, $attribute);
    $bool = $dtd->attr_list_allowed($tag, @attributes);
    $bool = $dtd->is_empty($tag);
    $bool = $dtd->is_defined($tag);
    $bool = $dtd->is_fixed($tag, $attribute);
    $bool = $dtd->attr_value_allowed($tag, $attribute, $value);
    $bool = $dtd->attr_list_value_allowed($tag, \%attribute_value);
    @tags = $dtd->get_document_tags();
    $regexp = $dtd->get_child_regexp($tag);
    @attributes = $dtd->get_attributes($tag);
    @req_attributes = $dtd->get_req_attributes($tag);
    $value = $dtd->get_allowed_attr_values($tag, $attribute);
    $default_value = $dtd->get_attr_def_value($tag, $attribute);
    $dtd->clear_cache();
    $errormessage = $dtd->errstr;
    $errornumber = $dtd->err;

DESCRIPTION

ParseDTD.pm is a Perl 5 object class which provides methods to access the information stored in a XML DTD.

This module basically tells you which tags are known by the dtd, which child tags a certain tag might have, which tags are defined as a empty tag, which attributes a certain tag might have, which values are allowed for a certain attribute, which attributes are required, which attributes are fixed, which attributes have which default value ... well i would say it tells you all except the entity definitions (they're on the ToDo list) that is defined in the dtd (at least all that i know of, but i'm not so much into that topic, so please make me aware if i missed something). All this information can be accessed in 2 diffrent ways: 1. you can simply get it 2. you can pass certain data and the module then tells you whether thats ok or not.

This package uses Cache::SharedMemoryCache to cache every parsed DTD, so next time the data structure representing the dtd can be just taken out of memory. Thus the dtd is not refetched and not parsed again which saves quite some time and work. You can easily modify the module so that is uses Cache::FileCache if you prefer, but i think SharedMemory is faster.

Everytime the constructor is called it first checks whether the given dtd is already in memory, if so it compares the last modified date to the date stored in memory and then decides whether it should refetch it or not. If the dtd lays on the local filesystem this operation doesn't produce any reasonable overhead, but if the dtd is fetched out of the internet it might make sense to not check the last modified header every time. You can configure how often it should be checked, by default it is checked averaged every third time. But since most dtds don't change it is mostly save to not check it at all.

Internally the parsed DTD data is simply stored in 6 hash structures. Because of this and because of the caching the module should be very fast.

USING XML::ParseDTD

The Constructor

new ($dtd_url, [ %conf ])

This method is the constructor. The first argument must be the path to a xml dtd, it should be a valid URL using the file or http protocol. Here are some examples:

  • http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

  • /home/moritz/xhtml1-strict.dtd

  • file://home/moritz/xhtml1-strict.dtd

The configuration hash can be used to influence the modules behaviour. The options known are:

  • checklm - configures how often the Last-Modified header should be checked if the http protocol is used. The Default is 3 that means that averaged it is checked every third time (dtd is refetched and reparsed if it was modified meanwhile). Setting it to 1 will force the module to always check the Last-Modified header, setting it to -1 will force it to never check the header (which is recommend if performance is important and its more or less sure that the dtd will not be changed).

    memkey - Identifier for the datastructure which is saved to and taken from the cache. Because this module uses the shared memory for caching, it is important that is identifier is really unique, else it would probably overwrite some data of another program. By default the Identifier is XML::ParseDTD. /URL of the parsed dtd is allways added to the value of this option to distinguish the dtds.

    timeout - The value of this option is simply passed to LWP::UserAgent as timeout value. Please see the documentation of LWP::UserAgent for more information. The default is 30. LWP::UserAgent is used to fetch dtds with the http protocol and to get their last modified header to know whether they have been modified.

    cache_expire - The value of this option is passed to Cache::Cache for setting the time when the cache will be expired and thus has to be rewritten. By default this is never. For possible values please read the documentation of Cache::Cache.

Note: You shouldn't set any option to 0 since it will not be interpreted, that means the default setting will be used instead.

Check Methods

child_allowed ($tag, $childtag)

Checks whether the given tag can contain the given childtag.

Returns 1 (true) or 0 (false).

child_list_allowed ($tag, @childtags)

Checks whether its ok if the given tag contains the given childtags in the given order. This means that the method will return ails if a certain tag is not allowed, a required tag is not given or the order is not allowed.

Returns 1 (true) or 0 (false).

attr_allowed ($tag, $attribute)

Checks whether the given attribute is allowed for the given tag.

Returns 1 (true) or 0 (false).

attr_list_allowed ($tag, @attributes)

Checks whether its ok if the given tag has set given attributes. This means that the method will return fails if a certain attribute is not allowed or a required attribute is not given.

Returns 1 (true) or 0 (false).

is_empty ($tag)

Checks whether the given tag is a empty tag, that means whether it can't contain any elements or data.

Returns 1 (true) or 0 (false).

is_any ($tag)

Checks whether the given tag has content model ANY.

Returns 1 (true) or 0 (false).

is_defined ($tag)

Checks whether the given tag is defined in the dtd, that means whether it is allowed in the document.

Returns 1 (true) or 0 (false).

is_fixed ($tag, $attribute)

Checks whether the given attribute for the given tag is a fixed attribute, that means if its value is predefined by the dtd.

If so, you can use get_allowed_attr_values to get the predefined value.

Returns 1 (true) or 0 (false)

attr_value_allowed ($tag, $attribute, $value)

Checks whether the given attribute for the given tag might be set to the given value.

Returns 1 (true) or 0 (false).

attr_list_value_allowed ($tag, \%attribute_value)

Calls attr_list_allowed for the attribute names, if everything is fine it calls attr_value_allowed for each value.

Returns 1 (true) or 0 (false).

Get Methods

get_document_tags

Returns a list of all tags which are defined in the dtd, that means which are allowed in the document.

get_child_regexp ($tag)

Returns the regular expression, which defines which combinations of child elements are valid for the given tag, as a string.

get_attributes ($tag)

Returns a list of all attributes which are allowed for the given tag.

get_req_attributes ($tag)

Returns a list of all required attributes for the given tag.

get_allowed_attr_values ($tag,$attribute)

Returns the allowed values for the given attribute for the given tag.

If only one certain string is allowed to be set as value, this string is returned. If the value must be one string out of a list of strings, a reference to this list is returned. If the value must be of a certain datatype such as PCDATA, ID or NMTOKEN, a reference to a hash with only one element is returned. The key is the name of the datatype and the value is a regular expression string which describes the datatype.

undef is returned if nothing is defined as attribute value, that normally means that the attribute is not known for the given tag, but you can call errstr to get more information.

get_attr_def_value ($tag,$attribute)

Returns the default value defined for the given attribute of the given tag. In most cases no default value is defined, that means that undef is returned. But undef is also returned if the tag does not exist or if the attribute is not allowed for the given tag. To get more information why undef was returned, you should call errstr.

Other Methods

clear_cache ()

Clears the cache, that means that the dtd will be refetched and reparsed next time.

errstr ()

Returns the message of the last occured error.

err ()

Returns the number of the last occured error.

BUGS

Send bug reports to: bug-XML-ParseDTD@rt.cpan.org (if that doesn't work feel free to send directly to moritz@freesources.org). Or use the webinterface at http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-ParseDTD.

Thanks!

AUTHOR

(c) 2003, Moritz Sinn. This module is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (see http://www.gnu.org/licenses/gpl.txt) as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

    This module is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

I am always interested in knowing how my work helps others, so if you put this module to use in any of your own code then please send me the URL. If you make modifications to the module because it doesn't work the way you need, please send me a copy so that I can roll desirable changes into the main release.

Address comments, suggestions, and bug reports to moritz@freesources.org.