The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::Table2XML - Generic conversion of tabular data to XML by reverting Excel's flattener methodology.

SYNOPSIS

        use XML::Table2XML;
        my $outXML = "";
        # first parse column path headers for attribute names, id columns and special common sibling mark ("//")
        parseHeaderForXML("rootNodeName", ['/@id','/@name2','/a']);
        # then walk through the whole data to build the actual XML string into $outXML
        my @datarows = ([1,"testName","testA"],
                                        [1,"testName","testB"],
                                        [1,"testName","testC"]);
        for my $lineData (@datarows) {
                $outXML.=addXMLLine($lineData);
        }
        #finally finish the XML and reset the static vars
        $outXML.=addXMLLine(undef);
        print $outXML;
        # yields:
        # <?xml version="1.0"?>
        # <rootNodeName id="1" name2="testName"><a>testA</a><a>testB</a><a>testC</a></rootNodeName>

DESCRIPTION

table2xml is an algorithm having two functions that allow the conversion of tabular data to XML without using XSLT. This is achieved by reverting the "Flattener" methodology used by Microsoft Excel to convert the XML tree format to a two-dimensional table (see Opening XML Files in Excel and INFO: Microsoft Excel 2002 and XML).

This reversion is achieved by:

1. (possibly) modifying the flattened table a bit to enable a simpler processing of the data,

2. sequentially processing the data column- and row wise.

The whole algorithm is done without the aid of any XML library, so it lends itself to easy translation into other environments and languages.

Producing the XML:

1. invoke parseHeaderForXML, using a line with the rootnode and path information.

2. After parsing the header info, the table data can be processed row by row by calling addXMLLine. The current data row is provided in the single argument lineData, the built XML is string returned and can be collected/written.

3. A final call to addXMLLine with lineData == undef restores the static variables and finalizes the XML string (closes any still open tags).

Public Functions

parseHeaderForXML ($rootNodeName,\@header,$LINEBREAKS,$XMLDIRECTIVE,$ENCODING)

rootNodeName is the name of the common root node. Any /@rootAttributes and /#text will be placed under respectively after this root node.

header is a list of paths denoting the "place" of the data in the targeted XML. Following special cases are allowed:

  • Plain elements

    are denoted by /node/subnode/subsubnode/etc.../elementName

  • Attributes

    are denoted by /node/subnode/subsubnode/etc.../@attributeName

  • "ID" nodes

    are denoted by /node/subnode/subsubnode/etc.../#id (they are not being ouptut)

  • special common sibling nodes

    are denoted by a leading double slash (//) special common sibling nodes are used for nested common sibling nodes (e.g., <root><a><b>test</b></a><otherData>...<root> or <root><a><b>test1</b><z>test2</z></a><otherData>...<root> ) must be located at the beginning of the last node within the nested sibling.

  • a root text element

    is denoted by /#text

  • root attributes

    are given as /@rootNodeAttribute

LINEBREAKS specifies whether '\n' should be added after each datarow for easier readablity, default is no linebreaks

XMLDIRECTIVE specifies any header being inserted before the root element, default is '<?xml version="1.0"?'>.

ENCODING denotes the Unicode Codification used to encode the string(s) returned by addXMLLine(), default is 'iso-8859-1'

$returnedXML = addXMLLine(\@lineData)

lineData is a list of data elements that are converted to XML following the parsed header information.

The produced XML is returned as a function value which can be concatenated or written to a file. Bear in mind that the returned XML is just a part of a larger structure, so only after the last line has been processed and addXMLLine(undef) has been called, the XML structure is finished.

Prerequisites for column order and data layout

The layout of the columns (header = "data paths" and respective column data below) has to follow a certain layout:

  • child nodes always have to follow their parent nodes (i.e. /a/b/c is after /a or /a/b).

  • "#id" columns and attributes belong to the same element node (e.g. /a/b/#id, /a/b/@att1 and /a/b/@att2) and therefore have to be given consecutively and with the "#id" column first (attributes and element node order is not important).

  • related subnodes have to be grouped together (i.e. /a/b, /a/c, /a/x, /a/x/@att, ...), other subnodes have to follow.

    The layout of the data below has to be as follows (recursively similar within the blocks for any sub-blocks):

     Block1PathHeaders            Block2PathHeaders       Block3PathHeaders
     Block1Data                   EMPTY                   EMPTY
     ...                          EMPTY                   EMPTY
     Block1Data                   EMPTY                   EMPTY
     EMPTY                        Block2Data              EMPTY
     EMPTY                        ...                     EMPTY
     EMPTY                        Block2Data              EMPTY
     EMPTY                        EMPTY                   Block3Data
     EMPTY                        EMPTY                   ...
     EMPTY                        EMPTY                   Block3Data
     

    where the corresponding XML would then look like:

     <root>
            <Block1>
                    <Block1subnode>
                    ...
                    </Block1subnode>
                    <Block1subnode>
                    ...
                    </Block1subnode>
                    ..
            </Block1>
            <Block2>
                    <Block2subnode>
                    ...
                    </Block2subnode>
                    <Block2subnode>
                    ...
                    </Block2subnode>
                    ..
            </Block2>
            <Block3>
                    <Block3subnode>
                    ...
                    </Block3subnode>
                    <Block3subnode>
                    ...
                    </Block3subnode>
                    ..
            </Block3>
     </root>
  • Sibling nodes that are "common" to a whole subnode (e.g. <subnode><commonSibling>value1</commonSibling><otherNodes>...</otherNodes></subnode>) have to be first in the subnode and need to "span" the data there.

    Example for subnode <a>:

     <?xml version="1.0"?>
     <root>
      <a>
       <z>TestB</z>
       <c>TestA1</c>
       <c>TestA2</c>
       <c>TestA3</c>
       <c>TestA4</c>
      </a>
     </root>
     
     /root
     /a/z/@x        /a/c
     TestB          TestA1
     TestB          TestA2
     TestB          TestA3
     TestB          TestA4
  • In case you happen to own MS Excel, the easiest way to get that layout is to follow the steps below:

    1. Open Target XML File in Excel (don't forget the XML directive there: "<?xml version="1.0"?>" !!!!)
    2. remove any "#agg" columns (used to differentiate between numerical common siblings and "real" data)
    3. move the common root (or the common subnode) siblings leftmost of the root (or resp. Subnode)

    Examples:

     <?xml version="1.0"?>
     <root>
     <x z="testAttX">testX</x>
     <a><b><c>TestA1</c>
     <c>TestA2</c>
     <c>TestA3</c>
     <c>TestA4</c></b></a>
     </root>
     
     /root                                          /root
     /a/b/c /x      /x/@z                           /x      /x/@z           /a/b/c
     TestA1 testX   testAttX                        testX   testAttX        TestA1
     TestA2 testX   testAttX        modify to->     testX   testAttX        TestA2
     TestA3 testX   testAttX                        testX   testAttX        TestA3
     TestA4 testX   testAttX                        testX   testAttX        TestA4
     
     <?xml version="1.0"?>
     <root>
     <a><z x="TestB"></z><b><c>TestA1</c>
     <c>TestA2</c>
     <c>TestA3</c>
     <c>TestA4</c></b></a>
     </root>
     
     /root                          /root
     /a/b/c /a/z/@x                 /a/z/@x /a/b/c
     TestA1 TestB                   TestB   TestA1
     TestA2 TestB   modify to->     TestB   TestA2
     TestA3 TestB                   TestB   TestA3
     TestA4 TestB                   TestB   TestA4
    4. For nested common sibling nodes (e.g., <root><a><b>test</b></a><otherData>...<root> or <root><a><b>test1</b><c>test2</c></a><otherData>...<root>), write a double slash at the beginning of the last node within the nested sibling.

    Example (also includes column moving as in the examples above):

     <?xml version="1.0"?>
     <root>
     <a n=""CW""><l c=""oalp""><p v=""A1""></p></l>
     <f c=""oalvl""><p v=""W""></p></f>
     <p n=""target""></p></a>
     <a n=""CD""><l c=""oalp""><p v=""A1""></p></l>
     <f c=""oalvl""><p v=""D""></p></f></a>
     <r><pr v=""TEST""></pr>
     <ar r=""test2""></ar>
     <ar r=""test4""></ar></r>
     </root>
     
     /root
     /a/@n  /a/f/@c /a/f/p/@v       /a/l/@c /a/l/p/@v       /a/p/@n /r/ar/@r        /r/pr/@v
     CW     oalvl   W               oalp    A1              target          
     CD     oalvl   D               oalp    A1                      
                                                                    test2           TEST
                                                                    test4           TEST
     modify to -->
     
     /root
     /a/@n  /a/l/@c //a/l/p/@v      /a/f/@c //a/f/p/@v      /a/p/@n /r/pr/@v        /r/ar/@r
     CW     oalp    A1              oalvl   W               target          
     CD     oalp    A1              oalvl   D                       
                                                                    TEST            test2
                                                                    TEST            test4
    5. For a first column of a subnode list that is not being a "primary key" column (i.e., having empty cells or continuous equal values), introduce an artificial #id column.

    Examples:

     <?xml version="1.0"?>
     <root>
     <a x="test1">testA</a>
     <a x="test2"></a>
     </root>
     
     /root                          /root                           
     /a     /a/@x   modify to->     /a/#id  /a      /a/@x           
     testA  test1                   1       testA   test1           
            test2                   2               test2           
     
     <?xml version="1.0"?>
     <root>
     <co><f><a>Numeric</a></f></co>
     <co><f><a>VarChar</a></f></co>
     <co><f><a>VarChar</a></f></co>
     <co><f><a>VarChar</a></f></co>
     <co><f><a>VarChar</a></f></co>
     <co><f><a>DBTimeStamp</a></f></co>
     <co><f><a>VarChar</a><fk>JOB_ID</fk><fl>JOB_TITLE</fl></f></co>
     <co><f><a>Numeric</a><fk>TESTID</fk><fl>TESTn</fl></f></co>
     <co><f><a>Numeric</a></f></co>
     <co><f><a>Numeric</a><fk>EMPLOYEE_ID</fk><fl>FIRST_n</fl></f></co>
     <co><f><a>Numeric</a><fk>DEPARTMENT_ID</fk><fl>DEPARTMENT_n</fl></f></co>
     </root>
     
     /root                                  modify to->     /root                   
     /co/f/a        /co/f/fk        /co/f/fl                /co/#id /co/f/a /co/f/fk        /co/f/fl
     Numeric                                                1       Numeric         
     VarChar                                                2       VarChar         
     VarChar                                                3       VarChar         
     VarChar                                                4       VarChar         
     VarChar                                                5       VarChar         
     DBTimeStamp                                            6       DBTimeStamp     
     VarChar        JOB_ID  JOB_TITLE                       7       VarChar JOB_ID          JOB_TITLE
     Numeric        TESTID  TESTn                           8       Numeric TESTID          TESTn
     Numeric                                                9       Numeric         
     Numeric        EMPLOYEE_ID     FIRST_n                 10      Numeric EMPLOYEE_ID     FIRST_n
     Numeric        DEPARTMENT_ID   DEPARTMENT_n            11      Numeric DEPARTMENT_ID   DEPARTMENT_n
    6. Use the header row and rootnodeName for your data layout.

LIMITATIONS

Generally, pay close attention to the ordering of columns and constraints on the data as described above, since the algorithm in writeLine doesn't check for validity, thus producing invalid XML in case of failing to follow preparation steps correctly.

In mixed content nodes, the only way to correctly (re)produce the XML is for ONE content being right after the node name. There's currently no way to produce mixed content nodes with more than one text node (e.g., <node>text1<subnode>Test</subnode>text2</node> and the like).

Same sequential parent nodes are "factored" out by the flattener, so the unflattening algorithm treats them as being factored out, which means there is no way to exactly reproduce (<a><b>test1</b></a><a><b>test2</b></a>, this would be processed as <a><b>test1</b><b>test2</b></a>, which is semantically equal, but not the same...).

REFERENCE

for a detailed discussion of the flattening algorithmm in Excel see https://web.archive.org/web/20041124175116/http://support.microsoft.com/kb/282161/EN-US/ and https://web.archive.org/web/20050210015617/http://support.microsoft.com/kb/288215/EN-US/

AUTHOR

Roland Kapl, rkapl@cpan.org

COPYRIGHT AND LICENSE

Copyright (C) 2006 by Roland Kapl

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.