The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PRANG::XMLSchema::Guide - converting .xsd to PRANG by hand

OVERVIEW

With XMLSchema, you are supplied with a set of .xsd files which define the schema. This is specified in XML format. This document goes through conversion of a real XML Schema document, the example given here is the RFC 5732 EPP host mapping.

This man page is structured so that it can be used both as a tutorial and a reference; for use as a reference, scan the headings and look for the one that corresponds to the XML Schema structure you are encountering. For use as a tutorial, read from start to finish, if you like with the history of XML::EPP open in gitk or a similar tool.

EXAMPLE - RFC 5732

MODULE NAMESPACE

The first thing to do is to choose a namespace which your classes will sit on. I like to keep each XML namespace in its own package heirarchy in Perl.

 <schema targetNamespace="urn:ietf:params:xml:ns:host-1.0"
       xmlns:host="urn:ietf:params:xml:ns:host-1.0"
       xmlns:epp="urn:ietf:params:xml:ns:epp-1.0"
       xmlns:eppcom="urn:ietf:params:xml:ns:eppcom-1.0"
       xmlns="http://www.w3.org/2001/XMLSchema"
       elementFormDefault="qualified">

Five namespaces are defined here; the last one, xmlns=..., says the namespace that the schema itself is in. xmlns:foo defines what namespace nodes in this document with that prefix are in. However, unlike regular use of XML namespaces, this affects the interpretation of values, in particular values of the type= attribute of various XML Schema declaration elements.

The important one is targetNamespace - I decide to map to the XML::EPP::Host namespace, and so I create;

 package XML::EPP::Host::Node;
 use Moose::Role;
 sub xmlns { "urn:ietf:params:xml:ns:host-1.0" }
 use XML::EPP::Common;
 1;

Every class I compose this role into will get that XML namespace. This affects the default namespace for has_element definitions that are defined within that class.

I also include the XML::EPP::Common class so that all type definitions are always present. In fact, this had already been built to correspond to the namespace URI associated with the eppcom prefix above.

CONVERT ROOT NODE(S) TO PRANG::Graph CLASSES

If your XML language can have only one main element type, then make the "XML::Whatever" a normal class. If the language can have multiple elements, use roles. In the RFC XML Schema appendix, there are found the following top-level element definitions:

 <!--
 Child elements found in EPP commands.
 -->
  <element name="check" type="host:mNameType"/>
  <element name="create" type="host:createType"/>
  <element name="delete" type="host:sNameType"/>
  <element name="info" type="host:sNameType"/>
  <element name="update" type="host:updateType"/>

Then later:

 <!--
 Child response elements.
 -->
  <element name="chkData" type="host:chkDataType"/>
  <element name="creData" type="host:creDataType"/>
  <element name="infData" type="host:infDataType"/>
  <element name="panData" type="host:panDataType"/>

There are 5 request elements and 4 response elements which can be used. In this case, I also want these objects to be types of the XML::EPP::Plugin role. This role requires the is_command function to be defined, so I'll make a couple of convenience roles for that, too:

So, I write:

 package XML::EPP::Host;
 use Moose::Role;
 with qw(XML::EPP::Plugin PRANG::Graph);
 1;

And:

 package XML::EPP::Host::RQ;
 use Moose::Role;
 with qw(XML::EPP::Host);
 sub is_command { 1 }
 1;

 package XML::EPP::Host::RS;
 use Moose::Role;
 with qw(XML::EPP::Host);
 sub is_command { 0 }
 1;

To define an allowed root node, I can then use:

 package XML::EPP::Host::Check;
 use Moose;
 use PRANG::Graph;
 sub root_element { "check" }
 with
      'XML::EPP::Host::RQ',
      'XML::EPP::Host::Node',
      ;

CONVERTING complexType TYPES TO ROLES

After we have the top-level definitions for each root element we can move on to define each sub-type.

Normally complexType definitions will become classes, however sometimes is is more appropriate to convert them to roles. In this instance we are forced to, because they are used for root elements in this XML Schema, and at the top level of the schema, a single type must correspond with a single root element.

In this case there is the problematic info and delete elements which share sNameType. sNameType seems to indicate a single list of items, and mNameType a list. I decide to call these "Item" and "List", and to make them both roles because they seem to be generic; see also "CHOOSING GOOD CLASS NAMES"

 package XML::EPP::Host::Item;
 # <!--
 # Child elements of the <delete> and <info> commands.
 # -->
 #  <complexType name="sNameType">
 #    <sequence>
 #      <element name="name" type="eppcom:labelType"/>
 #    </sequence>
 #  </complexType>
 #
 use Moose::Role;
 use PRANG::Graph;
 has_element 'value' =>
     is => "ro",
     isa => "XML::EPP::Common::labelType",
     ;

 package XML::EPP::Host::List;
 use Moose::Role;
 use PRANG::Graph;
 # <!--
 # Child element of commands that accept multiple names.
 # -->
 # <complexType name="mNameType">
 #   <sequence>
 #     <element name="name" type="eppcom:labelType"
 #      maxOccurs="unbounded"/>
 #   </sequence>
 # </complexType>
 has_element 'members' =>
     is => "ro",
     isa => "ArrayRef[XML::EPP::Common::labelType]",
     ;

Now I can make the check, delete and info messages concrete types, each with a single root_element.

 package XML::EPP::Host::Check;
 use Moose;
 use PRANG::Graph;
 sub root_element { "check" }
 with
      'XML::EPP::Host::RQ',
      'XML::EPP::Host::Node',
      'XML::EPP::Host::List';

 package XML::EPP::Host::Info;
 use Moose;
 use PRANG::Graph;
 sub root_element { "info" }
 with
      'XML::EPP::Host::RQ',
      'XML::EPP::Host::Node',
      'XML::EPP::Host::Item';

 package XML::EPP::Host::Delete;
 use Moose;
 use PRANG::Graph;
 sub root_element { "delete" }
 with
      'XML::EPP::Host::RQ',
      'XML::EPP::Host::Node',
      'XML::EPP::Host::Item';

That should be enough to parse these messages alone.

So we go back and use the message types from the XML::EPP::Host package, and we're done:

 package XML::EPP::Host;
 use Moose::Role;
 use XML::EPP::Host::Check;
 use XML::EPP::Host::Delete;
 use XML::EPP::Host::Info;
 with qw(XML::EPP::Plugin PRANG::Graph);
 1;

Let's try it!

 denix:~/src/XML-EPP$ perl -Mlib=lib t/22-xml-rfc5732-host.t -t "0[137]"
 1..9
 ok 1 - 22-xml-rfc5732-host/rfc-examples/01-check-command.xml - parsed OK
 ok 2 - 22-xml-rfc5732-host/rfc-examples/01-check-command.xml - emitted OK (30ms)
 ok 3 - 22-xml-rfc5732-host/rfc-examples/01-check-command.xml - XML output same
 ok 4 - 22-xml-rfc5732-host/rfc-examples/03-info-command.xml - parsed OK
 ok 5 - 22-xml-rfc5732-host/rfc-examples/03-info-command.xml - emitted OK (17ms)
 ok 6 - 22-xml-rfc5732-host/rfc-examples/03-info-command.xml - XML output same
 ok 7 - 22-xml-rfc5732-host/rfc-examples/07-delete-command.xml - parsed OK
 ok 8 - 22-xml-rfc5732-host/rfc-examples/07-delete-command.xml - emitted OK (18ms)
 ok 9 - 22-xml-rfc5732-host/rfc-examples/07-delete-command.xml - XML output same
 denix:~/src/XML-EPP$ 

Win! This point corresponds to the git commit called;

 rfc5732: implement <check>, <info>, <delete> commands

In XMP-EPP.git

Now I will deal with the createType;

 <!--
 Child elements of the <create> command.
 -->
  <complexType name="createType">
    <sequence>
      <element name="name" type="eppcom:labelType"/>
      <element name="addr" type="host:addrType"
       minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
  </complexType>

eppcom:labelType is already defined. So, I can refer to it:

 package XML::EPP::Host::Create;
 use Moose;
 use PRANG::Graph;
 sub root_element { "create" }

 with
      'XML::EPP::Host::RQ',
      'XML::EPP::Host::Node',
      ;

 has_element 'name' =>
    is => "ro",
    isa => "XML::EPP::Common::labelType",
    ;

addrType is not yet there to be added to XML::EPP::Host::Create. So, convert it first; here is the definition:

 <complexType name="addrType">
   <simpleContent>
     <extension base="host:addrStringType">
       <attribute name="ip" type="host:ipType"
        default="v4"/>
     </extension>
   </simpleContent>
 </complexType>

Ah! simpleContent! But wait! It has an attribute, so we can't use that mapping. Blast.

CONVERTING simpleType TYPES

Let's do addrStringType first.

  <simpleType name="addrStringType">
    <restriction base="token">
      <minLength value="3"/>
      <maxLength value="45"/>
    </restriction>
  </simpleType>

Ok, this one is a simpler case - takes the token type defined in the XML Schema spec, and restricts its... length... to 45? really? Oh well, whatever - if that's what it says that's what it says.

The convention I use is to put all simpletypes in the namespace of the entire module, with the type from the XSD file after it. So, in XML::EPP::Host we write:

 use Moose::Util::TypeConstraints;
 use PRANG::XMLSchema::Types;
 subtype "XML::EPP::Host::addrStringType"
    => as "PRANG::XMLSchema::token"
    => where { length $_ >= 3 and length $_ <= 45 };

Note the use of the 'token' type from the XML Schema type library, available in PRANG::XMLSchema::Types. There are a number of core XML Schema types defined in this library, if you find any which are missing please send a patch.

I also do the host:ipType:

 <simpleType name="ipType">
   <restriction base="token">
     <enumeration value="v4"/>
     <enumeration value="v6"/>
   </restriction>
 </simpleType>

Can become simply:

 enum "XML::EPP::Host::ipType" => qw(v4 v6);

Depending on the exact ordering of loading your classes, you might find that you need to lift your subtype and enum definitions into BEGIN blocks. They must always be defined before an attribute which uses them is defined, otherwise Moose will create an ->isa type constraint, not a Str sub-type type constraint.

Now, with these definitions in place I can get back to the addrType.

CONVERTING simpleContent TYPES TO CLASSES

simpleContent means a node with no element children, however it can have attributes and textual content. These must be converted to classes.

 <!--
 Child elements of the <create> command.
 -->
  <complexType name="createType">
    <sequence>
      <element name="name" type="eppcom:labelType"/>
      <element name="addr" type="host:addrType"
       minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
  </complexType>

Becomes the following:

 package XML::EPP::Host::Address;
 use Moose;
 use PRANG::Graph;
 with 'XML::EPP::Host::Node';
 has_element "name" =>
     is => "ro",
     isa => "XML::EPP::Host::addrStringType",
     xml_nodeName => "",
     coerce => 1,
     ;

 has_attr "ip" =>
      is => "ro",
      isa => "XML::EPP::Host::ipType",
      default => "v4",
      ;

Specifying an xml_nodeName on an element, of the empty string refers to a text node as the contents. Specifying the coerce => 1 option will allow values which can be safely transformed to the correct type transparently be converted. In this case, it is to pick up the default rule in PRANG::XMLSchema::Types which trims whitespace from PRANG::XMLSchema::token nodes.

Right, now I can pop the stack and finish the Create class:

 <element name="addr" type="host:addrType"
          minOccurs="0" maxOccurs="unbounded"/>

Becomes:

 has_element 'addr' =>
      is => "ro",
      isa => "ArrayRef[XML::EPP::Host::Address]",
      xml_min => 0,
      ;

With that, we should hopefully be able to parse the RFC create command:

 denix:~/src/XML-EPP$ perl -Mlib=lib t/22-xml-rfc5732-host.t -t 05
 1..3
 ok 1 - 22-xml-rfc5732-host/rfc-examples/05-create-command.xml - parsed OK
 ok 2 - 22-xml-rfc5732-host/rfc-examples/05-create-command.xml - emitted OK (36ms)
 ok 3 - 22-xml-rfc5732-host/rfc-examples/05-create-command.xml - XML output same
 denix:~/src/XML-EPP$ 

It worked!

This point is the git commit:

 rfc5732: implement <create> host message

USING COERCIONS FOR SIMPLER CONSTRUCTION

The Moose::Util::TypeConstraints functions for coercing between types are invaluable for writing easy to use classes.

For instance, if we write this in our Create class, we can pass Perl hashes instead of already constructed objects; here is a recipe:

 use Moose::Util::TypeConstraints;
 coerce __PACKAGE__
         => from "Str",
         => via { __PACKAGE__->new(value => $_) },
         ;
 coerce __PACKAGE__
         => from "HashRef",
         => via { __PACKAGE__->new($_) },
         ;

In this, it is assumed that if you pass a plain string, to create an object with the string in the value field. Similarly, if a HASH reference is passed (HashRef to Moose), then create an object, passing that to the ->new constructor.

Sadly, this does not imply that ArrayRef[XML::EPP::Host::Address] with coercion enabled will happily work. Moose should know this, but doesn't currently, so we have to declare a specific rule:

 coerce "ArrayRef[XML::EPP::Host::Address]"
     => from "ArrayRef[XML::EPP::Host::Address|HashRef|Str]"
     => via {
         my @rv = @$_;
         for ( @rv ) {
             if ( ref $_ eq "HASH" ) {
                 $_ = XML::EPP::Host::Address->new($_);
             }
             elsif ( !blessed $_ ) {
                 $_ = XML::EPP::Host::Address->new(
                     value => $_,
                        );
             }
         }
         \@rv;
     },
     ;

We are now up to

 rfc5732: useful coercions for Address

CHOOSING GOOD CLASS NAMES

It's good to be somewhat systematic about the conversion of type names from your schema to class names. But frequently the type names in the XML Schema will not be adequate for such use.

So, the rules I tend to use for sanitising type names are;

  • Remove any redundant Type or similar prefix/suffix

  • If the type name contains an abbreviation, consider un-abbreviating it if it is short enough.

  • If the resulting class ends up in CamelCase, consider if there is an alternative single word which summarises the notion better, or use deeper namespaces where it makes sense (see next rule).

    If the type is only ever used with a single element name, then the name of that element is also a candidate for a good name for the class.

  • Consider making types corresponding to actions on entities, live in Namespace::Entity::Action, rather than Namespace::ActionEntity.

One of the things to remember is that XML Schema type names are not normally visible to people working with the XML directly. Less care may be placed on making them understandable, than merely making them unique tokens useful enough for a standards maintainer to work with.

Where it does not cause any conflicts, consider exporting aliases to the raw XML Schema type names into the package which you are using, using subtype (from Moose::Util::TypeConstraints). For the example schema, I know this will always be safe as the types always end in Type.

This is normally as simple as using something like this at the end of your class definition:

 use Moose::Util::TypeConstraints;
 subtype "XML::EPP::Host::chgType" => as __PACKAGE__;

The above I used in the class I called XML::EPP::Host::Change.

In general, you can use these subtypes for type constraints (ie, isa =>) fields on attributes) instead of the Perl package name. However currently for xml_nodeName maps this does not work and the real class name must be given. This may be fixed in a later release.

In any case, you need to make sure that the class which defines the subtype is loaded before you define an attribute which uses that type; or Moose will convert it to an ->isa type constraint.

Also, don't use this trick for types which got converted into roles (as mNameType and sNameType were in the example) or you might get yourself into trouble later; if roles are used in type defintions, they imply plug-in like operation where all of the classes which implement that role are allowed (and they all have to be PRANG::Graph consumers).

This point corresponds to git commit:

 rfc5732: add subtype definitions at the end of corresponding files

CHOOSING GOOD ATTRIBUTE NAMES

Attribute and element names are more visible and well-known than XML Schema type names; so the bar for renaming them to moose attribute names is slightly higher as this will actually confuse people.

That being said, sometimes the attribute names are just awful. Case in point: statusType

  <attribute name="s" type="host:statusValueType"
             use="required"/>

s ? That's ... special? silly?

I'll call it status:

 has_attr "status" =>
     is => "ro",
     isa => "XML::EPP::Host::statusValueType",
     required => 1,
     xml_name => "s",
     ;

I used xml_name there to refer to the name it gets in the XML. This works with has_attr; for has_element, you must use xml_nodeName.

I personally will also convert to keep in line with perl-ish conventions:

  XML form      Perl attribute
  Capitalized   capitalized
  CamelCase     camel_case

HANDLING minOccurs AND maxOccurs

In the addRemType, the following definition appears:

   <sequence>
     <element name="addr" type="host:addrType"
      minOccurs="0" maxOccurs="unbounded"/>
     <element name="status" type="host:statusType"
      minOccurs="0" maxOccurs="7"/>
   </sequence>

The default for minOccurs and maxOccurs is 1 - making the attribute compulsary. This is the regular case, mapped to a single item type.

If maxOccurs is more than 1, then the slot in the sequence needs to be represented with an ArrayRef. Setting the type to be an ArrayRef type automatically makes PRANG default the xml_max to be unlimited, so the above definition is implemented in the corresponding class with:

 use XML::EPP::Host::Address;
 has_element 'addr' =>
     is => "ro",
     isa => "ArrayRef[XML::EPP::Host::Address]",
     xml_min => 0,
     ;
 use XML::EPP::Host::Status;
 has_element 'status' =>
     is => "ro",
     isa => "ArrayRef[XML::EPP::Host::Status]",
     xml_min => 0,
     xml_max => 7,
     ;

HANDLING MULTIPLE ELEMENTS WITH THE SAME TYPE

As the implementation of RFC5732 continues, we encounter the updateType, which contains multiple element definitions with the same type;

 <!--
 Child elements of the <update> command.
 -->
  <complexType name="updateType">
    <sequence>
      <element name="name" type="eppcom:labelType"/>
      <element name="add" type="host:addRemType"
       minOccurs="0"/>
      <element name="rem" type="host:addRemType"
       minOccurs="0"/>
      <element name="chg" type="host:chgType"
       minOccurs="0"/>
    </sequence>
  </complexType>

However, this is no problem! In fact, this is why types are mapped to classes and not elements.

I decided to name addRemType the simpler Delta (full name: XML::EPP::Host::Delta).

So, the definitions become:

 use XML::EPP::Host::Delta;
 has_element 'add' =>
    is => "ro",
    isa => "XML::EPP::Host::Delta",
    predicate => "has_add",
    coerce => 1,
    ;
 has_element 'remove' =>
    is => "ro",
    isa => "XML::EPP::Host::Delta",
    predicate => "has_remove",
    xml_nodeName => "rem",
    coerce => 1,
    ;

Also of passing note: setting predicate as in the above automatically implies xml_min => 0

We are now up to git commit:

 rfc5732: implement <update> command

Here is the parser in action:

 denix:~/src/XML-EPP$ perl -Mlib=lib t/22-xml-rfc5732-host.t -t 09
 1..3
 ok 1 - 22-xml-rfc5732-host/rfc-examples/09-update-command.xml - parsed OK
 ok 2 - 22-xml-rfc5732-host/rfc-examples/09-update-command.xml - emitted OK (47ms)
 ok 3 - 22-xml-rfc5732-host/rfc-examples/09-update-command.xml - XML output same
 denix:~/src/XML-EPP$ 

The remaining types were implemented in three further commits;

  rfc5732 - implement <check> response (<chkData>)
  rfc5732 - implement <info> and <create> responses
  rfc5732 - implement pending action notifications (<panData>) message

None of the methods used for these commits have not been touched on in this Guide.

OTHER XMLSchema CONSTRUCTS TO BE WARY OF

CONVERT <any namespace="##other"/> TO ROLES

The XML::EPP::SubCommand class is a conversion of the readWriteType XML Schema definition;

 <!--
 All other object-centric commands.  EPP doesn't specify the syntax or
 semantics of object-centric command elements.  The elements MUST be
 described in detail in another schema specific to the object.
 -->
   <complexType name="readWriteType">
     <sequence>
       <any namespace="##other"/>
     </sequence>
   </complexType>

Each point of other-schema inclusion like this is really a type; XML Schema does not have the semantics to specify this. In this case, XML::EPP::Plugin becomes the type:

 package XML::EPP::SubCommand;
 use Moose;
 use Moose::Util::TypeConstraints;
 use PRANG::Graph;

 use XML::EPP::Plugin;
 has_element "payload" =>
    is => "rw",
    isa => "XML::EPP::Plugin",
    ;

 with "XML::EPP::Node";

 subtype "XML::EPP::readWriteType"
     => as __PACKAGE__;

 1;

Once that facility is there, classes can specify that they may be included at that point by consuming the XML::EPP::Plugin role.

CONVERT attributeGroup TO ROLES

No examples yet, but a grouping of attributes as in an attributeGroup can be implemented with a role that includes lots of has_attr attributes.

CONVERT restriction TO TYPE CONSTRAINTS

Stuff like this;

  <simpleType name="pwType">
    <restriction base="token">
      <minLength value="6"/>
      <maxLength value="16"/>
    </restriction>
  </simpleType>

You can implement as:

  subtype "XML::EPP::pwType"
      as => "PRANG::XMLSchema::token",
      where => { length >= 6 and length <= 16 };

Enumerations such as this;

  <simpleType name="transferOpType">
    <restriction base="token">
      <enumeration value="approve"/>
      <enumeration value="cancel"/>
      <enumeration value="query"/>
      <enumeration value="reject"/>
      <enumeration value="request"/>
    </restriction>
  </simpleType>

Are much more succinctly described using the enum keyword from Moose::Util::TypeConstraints;

 enum "XML::EPP::transferOpType" =>
     qw(approve cancel query reject request);

The downside of this is that you don't get the default coerce behaviour of PRANG::XMLSchema::token, to strip leading and trailing whitespace. To do so, use a real constraint;

 our @tot = qw(approve cancel query reject request);
 subtype "XML::EPP::transferOpType"
     => as "PRANG::XMLSchema::token",
     => where { $_ ~~ @tot };

CONVERT complexType TO CLASSES

This really belongs earlier in this document, it has been implied from the first or second system. But here are the guidelines;

  • Any <attribute> sections get converted to has_attr attributes. If the attribute specifies use="required", set the Moose required => 1 attribute property.

  • The <sequence> portion is encapsulated by the order of definition of has_element attributes in the class. Be warned that when consuming roles, the order of addition of the attributes is not currently defined (as of Moose version 1.00 or so). So, be careful when putting more than one has_element in a role.

    This may be fixed in a later PRANG release (if possible) or a later Moose version.

  • Convert each <element> node to a has_element attribute. The type of the attribute as passed to isa => on the attribute should correspond to the type matching the XML Schema type definition.

    If the element has the minOccurs or maxOccurs, set predicate, xml_min and/or xml_max appropriately - as well as possibly making the attribute an ArrayRef type - see "HANDLING minOccurs AND maxOccurs", above.

MORE?

There are probably more XML Schema constructs than this. If you encounter difficulties, please ask for help - see PRANG for appropriate channels for this.

AUTHOR AND LICENCE

This documentation was written by Sam Vilain samv@cpan.org.

Development commissioned by NZ Registry Services, and carried out by Catalyst IT - http://www.catalyst.net.nz/

Copyright 2010, NZ Registry Services. This module is licensed under the Artistic License v2.0, which permits relicensing under other Free Software licenses.