Transform.pm - metacpan.org

package Unicode::Transform;

require 5.006001;

use strict;
no warnings 'utf8';

require Exporter;
require DynaLoader;

our $VERSION = '0.51';

our @ISA = qw(Exporter DynaLoader);

our @UTF_names = qw(utf16le utf16be utf32le utf32be utf8 utf8mod utfcp1047);
our @Codenames = ("unicode", @UTF_names);

our %EXPORT_TAGS = (
    'from' => [ map('unicode_to_'.$_, @UTF_names) ],
    'to'   => [ map($_.'_to_unicode', @UTF_names) ],
    'chr'  => [ map('chr_'.$_,        @Codenames) ],
    'ord'  => [ map('ord_'.$_,        @Codenames) ],
);

for my $a (@Codenames) {
    for my $b (@Codenames) {
	push @{ $EXPORT_TAGS{'conv'} }, "${a}_to_${b}";
    }
}

our @EXPORT    = (map @$_, @EXPORT_TAGS{qw(from to)});
our @EXPORT_OK = (map @$_, @EXPORT_TAGS{qw(conv chr ord)});
$EXPORT_TAGS{'all'} = \ @EXPORT_OK;

bootstrap Unicode::Transform $VERSION;

1;
__END__

=head1 NAME

Unicode::Transform - conversion among Unicode Transformation Formats

=head1 SYNOPSIS

    use Unicode::Transform ':all';

    $unicode_string = utf16be_to_unicode($utf16be_string);
    $utf16le_string = unicode_to_utf16le($unicode_string);
    $utf8_string    = utf32be_to_utf8   ($utf32be_string);

    $utf8_string    = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
         # ill-formed octet sequences are allowed.

=head1 DESCRIPTION

This module provides some functions to convert a string
among some Unicode Transformation Formats (UTF).

=head2 Conversion Between UTF

(Exporting: C<use Unicode::Transform ':conv';>)

=over 4

=item C<E<lt>SRC_UTF_NAMEE<gt>_to_E<lt>DST_UTF_NAMEE<gt>([CALLBACK,] STRING)>

=back

Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME.

B<Function names>

A function name consists of SRC_UTF_NAME, a string '_to_',
and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one
in the list of hyphen-removed and lowercased names following:

    unicode    (for Perl internal Unicode encoding; see perlunicode)
    utf16le    (for UTF-16LE)
    utf16be    (for UTF-16BE)
    utf32le    (for UTF-32LE)
    utf32be    (for UTF-32BE)
    utf8       (for UTF-8)
    utf8mod    (for UTF-8-Mod)
    utfcp1047  (for CP1047-oriented UTF-EBCDIC).

In all, 64 (i.e. 8 times 8) functions are available. Available function names
include C<utf16be_to_utf32le()> and C<utf8_to_unicode()>.
DST_UTF_NAME may be same as SRC_UTF_NAME like C<utf8_to_utf8()>.

Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are
defined well and stably. In contrast to these UTF, the Perl internal Unicode
encoding is influenced by the platform-dependent features (e.g. 32bit/64bit,
ASCII/EBCDIC).

B<Parameters>

If the first parameter is a reference, that is regarded as  the CALLBACK.
Any reference will not allowed as STRING. If CALLBACK is given,
the second parameter is STRING; otherwise the first is.
Currently, only code references are allowed as CALLBACK.

If CALLBACK is omitted, only Unicode scalar values (C<0x0000..0xD7FF>
and C<0xE000..0x10FFFF>) are allowed. Ill-formed octet sequences
(corresponding to a code point outside the range of Unicode scalar values)
and partial octets (which does not correspond to any code point) are deleted,
as if a code reference constantly returning an empty string,
C<sub {''}>, was used as CALLBACK.

Examples of partial octets: the first octet without following octets in UTF-8
like C<"\xC2">; the last octet in UTF-16BE,LE with odd number of octets.

If CALLBACK is specified, the appearance of an ill-formed octet sequences
or a partial octet calls the code reference. The first parameter for CALLBACK
is the unsigned integer value of its code point;
if the value is lesser than 256, that is a partial octet.

The return value from CALLBACK will be inserted there.
You may use C<chr_E<lt>DST_UTF_NAMEE<gt>()> as CALLBACK (see below).
Return value from CALLBACK should be in UTF of DST_UTF_NAME.

You can call C<die> or C<croak> in CALLBACK when you want to stop
the operation if the whole STRING would not be well-formed.

=head2 Conversion from Code Point to String

(Exporting: C<use Unicode::Transform ':chr';>)

=over 4

=item chr_C<E<lt>DST_UTF_NAMEE<gt>(CODEPOINT)>

=back

Returns a string in DST_UTF_NAME corresponding to CODEPOINT.
CODEPOINT should be an unsigned integer. If CODEPOINT is outside
the range of Unicode scalar values, a corresponding ill-formed
octet sequence will be returned.

If CODEPOINT is greater than the maximum value, returns C<undef>.
The maximum value of CODEPOINT is:

    0x0010_FFFF for chr_utf16le() and chr_utf16be()
    0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
    0xFFFF_FFFF for chr_utf32le(), chr_utf32be()

The maximum value of CODEPOINT for C<chr_unicode()> depends
on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC).

B<Function names>

The full list of functions provided:

=over 4

=item C<chr_unicode(CODEPOINT)>

=item C<chr_utf16le(CODEPOINT)>

=item C<chr_utf16be(CODEPOINT)>

=item C<chr_utf32le(CODEPOINT)>

=item C<chr_utf32be(CODEPOINT)>

=item C<chr_utf8(CODEPOINT)>

=item C<chr_utf8mod(CODEPOINT)>

=item C<chr_utfcp1047(CODEPOINT)>

=back

=head2 Numeric Value of the First Character

(Exporting: C<use Unicode::Transform ':ord';>)

=over 4

=item ord_C<E<lt>SRC_UTF_NAMEE<gt>(STRING)>

=back

Returns an unsigned integer value of the first character of STRING
in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence
corresponding to a surrogate code point (C<0xD800..0xDFFF>)
or an out-of-range code point (C<0x110000> and greater). If STRING
is empty or begins at a partial octet, returns C<undef>.

B<Function names>

The full list of functions provided:

=over 4

=item C<ord_unicode(STRING)>

=item C<ord_utf16le(STRING)>

=item C<ord_utf16be(STRING)>

=item C<ord_utf32le(STRING)>

=item C<ord_utf32be(STRING)>

=item C<ord_utf8(STRING)>

=item C<ord_utf8mod(STRING)>

=item C<ord_utfcp1047(STRING)>

=back

=head1 AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

Copyright(C) 2002-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

=head1 SEE ALSO

=over 4

=item http://www.unicode.org/reports/tr16/

UTF-EBCDIC (and UTF-8-Mod)

=back

=cut
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)