Enno Derksen > libxml-enno > XML::UM

Download:
libxml-enno-1.02.tar.gz

Dependencies

Annotate this POD

Related Modules

XML::Parser
more...
By perlmonks.org

CPAN RT

New  3
Open  0
View/Report Bugs
Source  

NAME ^

XML::UM - Convert UTF-8 strings to any encoding supported by XML::Encoding

SYNOPSIS ^

 use XML::UM;

 # Set directory with .xml files that comes with XML::Encoding distribution
 # Always include the trailing slash!
 $XML::UM::ENCDIR = '/home1/enno/perlModules/XML-Encoding-1.01/maps/';

 # Create the encoding routine
 my $encode = XML::UM::get_encode (
        Encoding => 'ISO-8859-2',
        EncodeUnmapped => \&XML::UM::encode_unmapped_dec);

 # Convert a string from UTF-8 to the specified Encoding
 my $encoded_str = $encode->($utf8_str);

 # Remove circular references for garbage collection
 XML::UM::dispose_encoding ('ISO-8859-2');

DESCRIPTION ^

This module provides methods to convert UTF-8 strings to any XML encoding that XML::Encoding supports. It creates mapping routines from the .xml files that can be found in the maps/ directory in the XML::Encoding distribution. Note that the XML::Encoding distribution does install the .enc files in your perl directory, but not the.xml files they were created from. That's why you have to specify $ENCDIR as in the SYNOPSIS.

This implementation uses the XML::Encoding class to parse the .xml file and creates a hash that maps UTF-8 characters (each consisting of up to 4 bytes) to their equivalent byte sequence in the specified encoding. Note that large mappings may consume a lot of memory!

Future implementations may parse the .enc files directly, or do the conversions entirely in XS (i.e. C code.)

get_encode (Encoding => STRING, EncodeUnmapped => SUB) ^

The central entry point to this module is the XML::UM::get_encode() method. It forwards the call to the global $XML::UM::FACTORY, which is defined as an instance of XML::UM::SlowMapperFactory by default. Override this variable to plug in your own mapper factory.

The XML::UM::SlowMapperFactory creates an instance of XML::UM::SlowMapper (and caches it for subsequent use) that reads in the .xml encoding file and creates a hash that maps UTF-8 characters to encoded characters.

The get_encode() method of XML::UM::SlowMapper is called, finally, which generates an anonimous subroutine that uses the hash to convert multi-character UTF-8 blocks to the proper encoding.

dispose_encoding ($encoding_name) ^

Call this to free the memory used by the SlowMapper for a specific encoding. Note that in order to free the big conversion hash, the user should no longer have references to the subroutines generated by get_encode().

The parameters to the get_encode() method (defined as name/value pairs) are:

CAVEATS ^

I'm not exactly sure about which Unicode characters in the range (0 .. 127) should be mapped to themselves. See comments in XML/UM.pm near %DEFAULT_ASCII_MAPPINGS.

The encodings that expat supports by default are currently not supported, (e.g. UTF-16, ISO-8859-1), because there are no .enc files available for these encodings. This module needs some more work. If you have the time, please help!

AUTHOR ^

Send bug reports, hints, tips, suggestions to Enno Derksen at <enno@att.com>.

syntax highlighting: