Martin Hosken > SIL-Shoe-1.37 > Scripts/sh2sh

Download:
SIL-Shoe-1.37.tar.gz

Annotate this POD

CPAN RT

New  2
Open  0
View/Report Bugs
Source  

TITLE ^

sh2sh - convert Shoebox data to Unicode

SYNOPSIS ^

    sh2sh -s settings_dir [-c codepage] [-e encs] infile [outfile]

Converts Shoebox data to Shoebox converting to Unicode as it goes.

OPTIONS ^

    -b              Delete empty fields
    -c codepage     Set default codepage conversion, otherwise none
    -e enc,enc      Add Encoding:: subsets in Perl 5.8.1
    -f type         Force database type
    -n normalform   normalize unicode text to D,C,KD,KC form
    -s dir          Directory to find .typ files in [.]
    -t type         Generate Toolbox database of given type

If outfile is missing, it is created as the input file with extension replaced by .db1. This allows a user to drop a data file on a shortcut.

DESCRIPTION ^

sh2sh converts a Shoebox (or Toolbox) database to Unicode. In particular it

Using sh2sh involves two aspects: preparing for conversion in terms of giving information about encoding conversion; and running the program, knowing what command line option does what.

Running sh2sh

Here we list the various command line options and give further details on each

-b

Any empty fields in the input file will be deleted.

-c

Specifies the default codepage to be used when converting data. In effect it specifies that sh2sh should act as though it were running on a system with the given default codepage. This means that data in languages with no given encoding conversion will be converted using this codepage.

-e

Perl has internal support for a large number of industry standard encodings. This option specifies which sets to pull in apart from the default set. Values include

  Byte - standard ISO 8859 type single byte encodings
  CN   - Continental China encodings including cp 936, GB 12345 and GB 2312
  JP   - Japanese encodings including cp 932 and ISO 2022
  KR   - Korean encodings including cp 949
  TW   - Taiwanese encodings including cp 950
  HanExtra - more Chinese encodings including GB 18030
  JIS2K - More Japanese encodings
  Ebcdic - surely not!
  Symbols - various symbol encodings

See man Encode::Supported or the corresponding module documentation for details of what is supported on your Perl installation.

-f

Rather than analysing the data in the file using the database type specified in the database, it is possible to specify that a different one should be used.

-n

Particularly with respect to Roman script languages involving letters with diacritics, there are two options as to how these are to be stored. They can either be stored as a single code (if such exists in Unicode) in which case the form to be asked for is C (composed), otherwise they can be stored using separate codes for base and diacritic and the normal form is D (decomposed). There are other normal forms which should only be used if you really know what you are doing (and then you will know why they shouldn't be used).

-s

sh2xml requires access to information about the structure of the database and language information. This is held in files in the same directory as the .prj project file used when running Shoebox/Toolbox.

-t

Gives the name of a database type that is given to the output file. Since the encoding has changed, the old database type is no longer appropriate for the output data. If a new database type has already been created that makes reference to the appropriate languages based on Unicode. In order to access the old database type name as part of the new name, all occurrences of the string %T in the -t option will be replaced with the old database type name.

Preparing for Conversion

The basic need is to be able to specify how to convert text in a particular language into Unicode. This can be done by specifying a conversion mapping in each language file. Shoebox and Toolbox do not have a UI for specifying such conversion information, so we add information to the options/description field. The codepage specification takes the form:

  \codepage = value

The specification needs to be on a line on its own. The value can take a number of forms.

name

A mapping name either from the set of names supported by the Perl Encode module, or specified in an SIL Converters repository.

filename.tec

The path and filename of a TECkit binary mapping file. The path is relative to the settings directory.

none

No mapping should be done. The data is assumed to be in UTF-8 encoding.

syntax highlighting: