The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Convert::Translit, transliterate, build_substitutes - Perl module for string conversion among numerous character sets

SYNOPSIS

use Convert::Translit;

  $translator = new Convert::Translit($result_chset);
  $translator = new Convert::Translit($orig_chset, $result_chset);
  $translator = new Convert::Translit($orig_chset, $result_chset, $verbose);

  $result_st = $translator->transliterate($orig_st);
  $result_st = Convert::Translit::transliterate($orig_st);

  build_substitutes Convert::Translit();

  Convert::Translit::build_substitutes();

DESCRIPTION

This module converts strings among 8-bit character sets defined by IETF RFC 1345 (about 128 sets). The RFC document is included so you can look up character set names and aliases; it's also read by the module when composing conversion maps. Failing functions or objects return undef value.

Export_OK Functions:

transliterate()

returns a string in $result_chset for an argument string in $orig_chset, transliterating by a map composed by new().

build_substitutes()

rebuilds the file "substitutes" containing character definitions and approximate substitutions used when a character in $orig_chset isn't defined in $result_chset. For example, "Latin capital A" may be substituted for "Latin capital A with ogonek". It takes a long time to rebuild this file, but you should never need to. Its only source of information is file "rfc1345".

Object methods:

new()

creates a new object for converting from $orig_chset to $result_chset, these being names (or aliases) of 8-bit character sets defined in RFC 1345. If only one argument, then $orig_chset is assumed "ascii". If three arguments, the third is verbosity flag. Verbose output lists approximate substitutions and other compromises.

transliterate()

is same as the function of that name.

build_substitutes()

is same as the function of that name.

FILES

 Convert/Translit/rfc1345  (IETF RFC 1345, June 1992)
 Convert/Translit/substitutes

METHODOLGY

Only one-to-one character mapping is done, so characters with diacritics (like A-ogonek) are never converted to (letter character, diacritic character) pairs, rather are subject to simplification. If no approximate substitute is available, then a unrelated substitute is chosen, preferably with the same code value. Undefined $orig_chset characters are translated to a chosen indicator character. Transliteration is not guaranteed commutative when substitutions were required. An $orig_chset defined as 7-bit is assumed to be repeated to make an 8-bit set (in the style of "extended ascii"); no such adjustment is made for $result_chset. The few mistakes in the RFC document are corrected in the module.

EXAMPLES

  Convert Russian language text from IBM to ASCII encoding:
  $xxx = new Convert::Translit("EBCDIC-Cyrillic", "Cyrillic");
  $ascii_cyr_st = $xxx->transliterate($ibm_cyr_st);

  Convert from plain ASCII (default $orig_chset) to Latin2 (Central European):
  $yyy = new Convert::Translit("Latin2");
  $cnt_eur_st = $yyy->transliterate($ascii_st);

  Since plain ASCII is subset of Latin2, nothing is lost in transliteration.
  But going the other direction requires numerous simplifications:
  $zzz = new Convert::Translit("Latin2", "ascii");
  $ascii_st = $zzz->transliterate($cnt_eur_st);

  Back to ASCII again, although substitutions probably mean ($again ne $cnt_eur_st):
  $again = $yyy->transliterate($ascii_st);

  The example.pl script converts a Polish language phrase from Latin2 to EBCDIC-US.

PORTABILITY

Requires Perl version 5. Developed with MacPerl on Macintosh 68040 OS 7.6.1. Tested on Sun Unix 4.1.3.

AUTHOR

Genji Schmeder <genji@community.net>

  Enjoy in good health.
  Cieszcie sie dobrym zdrowiem.
  Que gozen con salud.
  Benutze es heilsam gern!
  Genki dewa, yorokobi nasai.

COPYRIGHT

Version 1.03 dated 5 November 1997. Copyright (c) 1997 Genji Schmeder. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

ACKNOWLEDGEMENTS

  Chris Leach, author of EBCDIC.pm
  Keld Simonsen, author of RFC 1345