
Unicode::Japanese - Japanese Character Encoding Handler

use Unicode::Japanese;
# convert utf8 -> sjis
print Unicode::Japanese->new($str)->sjis;
# convert sjis -> utf8
print Unicode::Japanese->new($str,'sjis')->get;
# convert sjis (imode_EMOJI) -> utf8
print Unicode::Japanese->new($str,'sjis-imode')->get;
# convert ZENKAKU (utf8) -> HANKAKU (utf8)
print Unicode::Japanese->new($str)->z2h->get;

Module for conversion among Japanese character encodings.
get() method returns utf-8 `bytes' string in current release. in future, the behavior of get() maybe change.
sjis(), jis(), utf8(), etc.. methods return bytes string. The input of new, set, and a getcode method is not asked about utf8/bytes.

Creates a new instance of Unicode::Japanese.
If arguments are specified, passes through to set method.
Set a string in the instance. If '$icode' is omitted, string is considered as UTF-8.
To specify a encodings, choose from the following; 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le', 'utf32', 'utf32-ge', 'utf32-le', 'ascii', 'binary', 'sjis-imode', 'sjis-doti', 'sjis-jsky'.
'&#dddd' will be converted to "EMOJI", when specified 'sjis-imode' or 'sjis-doti'.
For auto encoding detection, you MUST specify 'auto' so as to call getcode() method automatically.
For ASCII encoding, only 'base64' may be specified. With it, the string will be decoded before storing.
To decode binary, specify 'binary' as the encoding.
Gets a string with UTF-8.
return `bytes' string in current release, this behavior will be changed.
utf8() method for `character' string or getu() method for `bytes' string seems better.
Gets a string with UTF-8.
On perl-5.8.0 and later, return value is with utf-8 flag.
Detects the character encodings of $str.
Notice: This method detects NOT encoding of the string in the instance but $str.
Character encodings are distinguished by the following algorithm:
(In case of PurePerl)
(In case of XS)
ascii / euc-jp / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky / sjis-imode / sjis-doti
utf32-be / utf32-le / ascii / jis / euc-jp / sjis / sjis-jsky / sjis-imode / sjis-doti / utf8
Regarding the algorithm, pay attention to the following:
Because each of XS and PurePerl has a different algorithm, A result of the detection would be possibly different. In case that the string is SJIS with escape characters, it would be considered as SJIS on PurePerl. However, it can't be detected as S-JIS on XS. This is because by using Algorithm, the string can't be distinguished between SJIS and SJIS-Jsky. This exclusion of escape characters on XS from the detection is suppose to be the same for EUC-JP.
Gets a string converted to $ocode.
For ASCII encoding, only 'base64' may be specified. With it, the string encoded in base64 will be returned.
On perl-5.8.0 and later, return value is not with utf-8 flag, and is bytes string.
Replaces the substrings "&#dddd;" in the string with the binary entity they mean.
Converts ZENKAKU to HANKAKU.
Converts HANKAKU to ZENKAKU.
Converts HIRAGANA to KATAKANA.
Converts KATAKANA to HIRAGANA.
$str: string (JIS)
Gets the string converted to ISO-2022-JP(JIS).
$str: string (EUC-JP)
Gets the string converted to EUC-JP.
$str: `bytes' string (UTF-8)
Gets the string converted to UTF-8.
On perl-5.8.0 and later, return value is not with utf-8 flag, and is bytes string.
$str: string (UCS2)
Gets the string converted to UCS2.
$str: string (UCS4)
Gets the string converted to UCS4.
$str: string (UTF-16)
Gets the string converted to UTF-16(big-endian). BOM is not added.
$str: string (SJIS)
Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932).
$str: string (SJIS/imode_EMOJI)
Gets the string converted to SJIS for i-mode. This method is alias of sjis_imode2 on VERSION 0.15.
$str: string (SJIS/imode_EMOJI)
Gets the string converted to SJIS for i-mode. $str includes only basic pictgraphs, and is without extended pictgraphs.
$str: string (SJIS/imode_EMOJI)
Gets the string converted to SJIS for i-mode. $str includes both basic pictgraphs, and extended ones.
$str: string (SJIS/dot-i_EMOJI)
Gets the string converted to SJIS for dot-i.
$str: string (SJIS/J-SKY_EMOJI)
Gets the string converted to SJIS for j-sky. This method is alias of sjis_jsky2 on VERSION 0.15.
$str: string (SJIS/J-SKY_EMOJI)
Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 3.
$str: string (SJIS/J-SKY_EMOJI)
Gets the string converted to SJIS for j-sky. $str includes from Page 1 to Page 6.
Splits the string by length($len).
On perl-5.8.0 and later, each element in return array is with utf-8 flag.
$len: `visual width' of the string
Gets the length of the string. This method has been offered to substitute for perl build-in length(). ZENKAKU characters are assumed to have lengths of 2, regardless of the coding being SJIS or UTF-8.
@values: data array
Converts the array to a string in CSV format, then stores into the instance. In the meantime, adds a newline("\n") at the end of string.
@values: data array
Splits the string, accounting it is in CSV format. Each newline("\n") is removed before split.
on perl-5.8.0 and later, utf-8 flag of return value depends on icode of set method. if $s contains binary, return value is bytes too. if $s contains any string, return value is with utf-8 flag.

Mapped as MS-CP932. Mapping table in the following URL is used.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
If a character cannot be mapped to SJIS from Unicode, it will be converted to &#dddd; format.
Also, any unmapped character will be converted into "?" when converting to SJIS for mobile phones.
Converted to SJIS and then mapped to Unicode. Any non-SJIS character in the string will not be mapped correctly.
Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 - U+0FF9FF.
Portion of involving "EMOJI" in F000 - F4FF is mapped to U+0FF000 - U+0FF4FF.
"J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape sequences, the first byte, the second byte and "\x0f". With sequential "EMOJI"s of identical first bytes, it may be compressed by arranging only the second bytes.
4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and the second bytes make one EMOJI character.
Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the first bytes of a sequence of "EMOJI" are identical.

use Unicode::Japanese qw(PurePerl);
If module was loaded with 'PurePerl' keyword, it works on Non-XS mode.

When string include such non-standard Shift_JIS, they will not detected as SJIS. Also, getcode() and all convert method will not work correctly.

Copyright 2001-2004 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All right reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Bug reports and comments to: mikage@cpan.org. Thank you.

Thanks very much to:
NAKAYAMA Nao
SUGIURA Tatsuki & Debian JP Project