Convert::Translit, transliterate, build_substitutes - Perl module for string conversion among numerous character sets
$translator = new Convert::Translit($result_chset); $translator = new Convert::Translit($orig_chset, $result_chset); $translator = new Convert::Translit($orig_chset, $result_chset, $verbose); $result_st = $translator->transliterate($orig_st); $result_st = Convert::Translit::transliterate($orig_st); build_substitutes Convert::Translit(); Convert::Translit::build_substitutes();
This module converts strings among 8-bit character sets defined by IETF RFC 1345 (about 128 sets). The RFC document is included so you can look up character set names and aliases; it's also read by the module when composing conversion maps. Failing functions or objects return undef value.
returns a string in $result_chset for an argument string in $orig_chset, transliterating by a map composed by new().
rebuilds the file "substitutes" containing character definitions and approximate substitutions used when a character in $orig_chset isn't defined in $result_chset. For example, "Latin capital A" may be substituted for "Latin capital A with ogonek". It takes a long time to rebuild this file, but you should never need to. Its only source of information is file "rfc1345".
creates a new object for converting from $orig_chset to $result_chset, these being names (or aliases) of 8-bit character sets defined in RFC 1345. If only one argument, then $orig_chset is assumed "ascii". If three arguments, the third is verbosity flag. Verbose output lists approximate substitutions and other compromises.
is same as the function of that name.
is same as the function of that name.
Convert/Translit/rfc1345 (IETF RFC 1345, June 1992) Convert/Translit/substitutes
Only one-to-one character mapping is done, so characters with diacritics (like A-ogonek) are never converted to (letter character, diacritic character) pairs, rather are subject to simplification. If no approximate substitute is available, then a unrelated substitute is chosen, preferably with the same code value. Undefined $orig_chset characters are translated to a chosen indicator character. Transliteration is not guaranteed commutative when substitutions were required. An $orig_chset defined as 7-bit is assumed to be repeated to make an 8-bit set (in the style of "extended ascii"); no such adjustment is made for $result_chset. The few mistakes in the RFC document are corrected in the module.
Convert Russian language text from IBM to ASCII encoding: $xxx = new Convert::Translit("EBCDIC-Cyrillic", "Cyrillic"); $ascii_cyr_st = $xxx->transliterate($ibm_cyr_st); Convert from plain ASCII (default $orig_chset) to Latin2 (Central European): $yyy = new Convert::Translit("Latin2"); $cnt_eur_st = $yyy->transliterate($ascii_st); Since plain ASCII is subset of Latin2, nothing is lost in transliteration. But going the other direction requires numerous simplifications: $zzz = new Convert::Translit("Latin2", "ascii"); $ascii_st = $zzz->transliterate($cnt_eur_st); Back to ASCII again, although substitutions probably mean ($again ne $cnt_eur_st): $again = $yyy->transliterate($ascii_st); The example.pl script converts a Polish language phrase from Latin2 to EBCDIC-US.
Requires Perl version 5. Developed with MacPerl on Macintosh 68040 OS 7.6.1. Tested on Sun Unix 4.1.3.
Genji Schmeder <email@example.com>
Enjoy in good health. Cieszcie sie dobrym zdrowiem. Que gozen con salud. Benutze es heilsam gern! Genki dewa, yorokobi nasai.
Version 1.03 dated 5 November 1997. Copyright (c) 1997 Genji Schmeder. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Chris Leach, author of EBCDIC.pm Keld Simonsen, author of RFC 1345