The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
{\rtf1\ansi\ansicpg1252 {\fonttbl\f0\fswiss\fcharset0 Arial;\f1\fswiss\fcharset0 Arial;\f2\froman\fcharset0 Times New Roman;\f3\froman\fcharset0 Times New Roman;\f4\fmodern\fcharset0 Courier;\f5\ftech\fcharset0 Symbol;\f6\fmodern\fcharset0 Courier-Oblique;\f7\froman\fcharset0 Times New Roman;}\pard\plain\ql\f0\fs20 {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 1 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\b\f1\fs34 Typesetting Interlinear Text\par }{\fs34 An Interlinear text to RTF converter\par }{\i\f2\fs22 Martin Hosken,\par SIL Non-Roman Script Initiative (NRSI)\par }{\b\f1\fs28 Introduction\par }{\f3\fs22 The laying out and typesetting of interlinear text is one of the more problematic aspects of\par linguistics. It has been one that I have been addressing, on and off, for quite a few years, and have\par only really seen any success now. This program is designed to aid with that task by automating\par the process of converting interlinear text, held in Shoebox, into RTF for inclusion in a Word\par document. There are actually two programs as part of this system. The first converts the\par interlinear text into an intermediate form of standard format and the second is an enhanced\par Shoebox Standard Format to RTF converter.\par This paper will discuss the installation and use of the programs and then go on to a discussion of\par the complexity of the interlinear typesetting problem and the approach taken in this system. This\par may be of use for those pushing the extremes of interlinearising.\par One pre-requisite is that users have a program for handling }{\f4\fs18 .tar.gz }{\f3\fs22 files on their system. WinZip\par is amply sufficient for the task, as are many of the modern zip handling type programs. You will\par also need a 32-bit Windows environment with support for long filenames.\par }{\b\f1\fs28 Installation\par }{\f3\fs22 The software is written in Perl, an interpretted language designed with text processing and system\par interaction in mind. This section will lead you through installing Perl, pmake and then the\par interlinear processing system.\par }{\b\f1\fs24 Installing Perl\par }{\f3\fs22 The company ActiveState have helped the Perl user community greatly by producing a very nice\par installation kit for Perl. It is free, and it is called: ActivePerl. It is available from their website:\par http://www.activestate.com/ActivePerl. Download the installation kit and run it to install Perl.\par Feel free to follow the defaults, unless you have other plans.\par A basic Perl installation takes up about 30Mb, with most of it being support libraries, and help\par files.}\par {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 2 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\b\f1\fs24 Installing the Interlinear system\par }{\f3\fs22 Installing the interlinear system is not difficult. You extract everything from a .tar.gz file, run\par }{\f4\fs18 perl Makefile.PL}{\f3\fs22 , run }{\f4\fs18 pmake install}{\f3\fs22 . The instructions are in the readme.txt file, but we will\par include them here:\par }{\f5\fs22 • }{\f3\fs22 Open the installation file using WinZip, or the like.\par }{\f5\fs22 • }{\f3\fs22 Extract the files into a temporary directory and open a DOS box (Command prompt) and\par change to that directory\par }{\f5\fs22 • }{\f3\fs22 Follow the instructions in the readme file, adjusting }{\f4\fs18 make }{\f3\fs22 to }{\f4\fs18 pmake}{\f3\fs22 :\par }{\f4\fs22 o }{\f3\fs22 Type: }{\f4\fs18 perl Makefile.PL\par }{\f4\fs22 o }{\f3\fs22 Type: }{\f4\fs18 pmake install\par }{\f4\fs22 o }{\f3\fs22 Type: }{\f4\fs18 pmake realclean\par }{\f3\fs22 This process will add various files to your Perl support libraries and also add two batch files to\par your Perl binary directory (thus making them available in your PATH).\par Notice that this whole installation process works across platforms. In the case of a Mac you\par should be able to install by installing the CPAN module first and then dropping the .tar.gz file on\par the install_me droplet. On Unix, you install as per above, but can use }{\f4\fs18 make }{\f3\fs22 instead of }{\f4\fs18 pmake}{\f3\fs22 .\par }{\b\f1\fs28 Useage\par }{\f3\fs22 In this section we examine how to use this typesetting system and give an example from the\par Shoebox sample files.\par The programs use knowledge of the database structure to work out what to do with each marker\par in a database. For this reason, every program must be told where the appropriate Shoebox settings\par directory is for this database. This is the directory in which the project file (}{\f4\fs18 .prj}{\f3\fs22 ) is held which\par you use when working with your interlinear database.\par Processing an interlinear file to convert it to an intermediate }{\f4\fs18 .sfm }{\f3\fs22 format is straightforward. But\par the conversion of that }{\f4\fs18 .sfm }{\f3\fs22 file to RTF needs a modicum of care. The RTF conversion process\par uses the information in the database type file (}{\f4\fs18 .typ}{\f3\fs22 ) to indicate how various markers should be\par converted to styles in the final output. There are three things that need to be right about this\par information, otherwise you may get some very odd looking output.\par }{\f5\fs22 • }{\f3\fs22 Add a marker: }{\f4\fs18 _RTF }{\f3\fs22 with a marker name of }{\f4\fs18 _RTFONLY_ }{\f3\fs22 (case is important), which is used\par by the system to pass pure RTF from the interlinear converter through to the output.\par }{\f5\fs22 • }{\f3\fs22 Add a marker: }{\f4\fs18 _INT }{\f3\fs22 with a marker name of }{\f4\fs18 Interlinear Block}{\f3\fs22 . This should be a\par paragraph marker. The marker name (and so style name) is unimportant and indicates the\par name of the paragraph style used for interlinear text blocks.\par }{\f5\fs22 • }{\f3\fs22 All the markers used in the interlinear process (e.g. }{\f4\fs18 \\t}{\f3\fs22 , }{\f4\fs18 \\m}{\f3\fs22 , }{\f4\fs18 \\ps}{\f3\fs22 , etc.) should be set to be\par }{\i\f2\fs22 character }{\f3\fs22 styles rather than paragraph. This is important to ensure that you don’t end up\par with paragraph marks in the middle of the interlinear blocks.\par Nearly all problems with garbage output are due to not following these instructions in the\par database properties before converting to RTF.\par Once the database type file is correct, we can start to process some data. There are two programs\par used:}\par {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 3 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\f4\fs18 shintr -s }{\i\f6\fs18 directory }{\i\f2\fs22 inputfile outputfile\par }{\f3\fs22 which converts the }{\i\f2\fs22 inputfile }{\f3\fs22 interlinear database into a temporary }{\i\f2\fs22 outputfile }{\f3\fs22 ready for RTF\par conversion. }{\i\f6\fs18 directory }{\f3\fs22 indicates where the Shoebox settings may be found (and so the database\par type file). A quick summary help may be found by typing: }{\f4\fs18 shintr }{\f3\fs22 on its own.\par The next process takes the temporary file from }{\f4\fs18 shintr }{\f3\fs22 and creates an RTF file from it:\par }{\f4\fs18 sh_rtf -s }{\i\f6\fs18 directory }{\i\f2\fs22 inputfile outputfile\par }{\f3\fs22 The command line is of the same structure as the other program. This takes the }{\i\f2\fs22 inputfile }{\f3\fs22 and\par converts it to an }{\i\f2\fs22 outputfile }{\f3\fs22 in RTF format (so it would help for it to have a }{\f4\fs18 .rtf }{\f3\fs22 extension). The\par data need not be output from }{\f4\fs18 shintr}{\f3\fs22 , but output from }{\f4\fs18 shintr }{\f3\fs22 must be processed using }{\f4\fs18 sh_rtf\par }{\f3\fs22 rather than the normal RTF output from Shoebox, due to the enhancements provided in }{\f4\fs18 sh_rtf}{\f3\fs22 .\par }{\b\f1\fs24 Walkthrough\par }{\f3\fs22 The following is a walkthrough example based on the directory\par }{\f4\fs18 ...\\Shoebox\\Samples\\Adapt\\Adapt2b }{\f3\fs22 which contains an example database which we can play\par with for output to RTF.\par }{\f5\fs22 • }{\f3\fs22 Launch the }{\f4\fs18 .prj }{\f3\fs22 file (}{\f4\fs18 EngAdapt.prj}{\f3\fs22 ) which will open the various databases involved in\par this example.\par }{\f5\fs22 • }{\f3\fs22 Go to }{\b\f7\fs22 P}{\f3\fs22 rojects/}{\b\f7\fs22 D}{\f3\fs22 atabase Types... (from the Projects menu), and select the Interlinear\par database type and press }{\b\f7\fs22 M}{\f3\fs22 odify...\par }{\f5\fs22 • }{\f3\fs22 Click }{\b\f7\fs22 A}{\f3\fs22 dd... and add a new marker: }{\f4\fs18 _INT }{\f3\fs22 with a field name of }{\f4\fs18 Interlinear Block}{\f3\fs22 . You\par may leave everything else about the marker as you find it. Click }{\b\f7\fs22 O}{\f3\fs22 k.\par }{\f5\fs22 • }{\f3\fs22 Click }{\b\f7\fs22 A}{\f3\fs22 dd... and add a new marker: }{\f4\fs18 _RTF }{\f3\fs22 with a field name of }{\f4\fs18 _RTFONLY_ }{\f3\fs22 (one word, all\par in capitals and the underscore letter). Again, you may leave everything else as you find it.\par Click }{\b\f7\fs22 O}{\f3\fs22 k.\par }{\f5\fs22 • }{\f3\fs22 For each of the markers: }{\f4\fs18 e}{\f3\fs22 , }{\f4\fs18 e1}{\f3\fs22 , }{\f4\fs18 e2}{\f3\fs22 , }{\f4\fs18 mb}{\f3\fs22 , }{\f4\fs18 ps}{\f3\fs22 , }{\f4\fs18 t }{\f3\fs22 repeat the following process:\par }{\f4\fs22 o }{\f3\fs22 Click on the marker\par }{\f4\fs22 o }{\f3\fs22 Press modify\par }{\f4\fs22 o }{\f3\fs22 Press the }{\f4\fs18 Character }{\f3\fs22 radio button under Style to Export\par }{\f4\fs22 o }{\f3\fs22 Press }{\b\f7\fs22 O}{\f3\fs22 k.\par }{\f5\fs22 • }{\f3\fs22 Press }{\b\f7\fs22 O}{\f3\fs22 k to exit the database type properties dialog, and click }{\b\f7\fs22 C}{\f3\fs22 lose to exit the database\par types dialog.\par }{\f5\fs22 • }{\f3\fs22 Exit Shoebox (}{\b\f7\fs22 F}{\f3\fs22 ile/}{\b\f7\fs22 E}{\f3\fs22 xit)\par Now that we have set up the database type file for this project, we can convert interlinear data as\par often as we want without having to go through that process again.\par }{\f5\fs22 • }{\f3\fs22 Open a DOS box (Command Prompt) and change to the directory we are working in\par (}{\f4\fs18 .../Shoebox/Samples/Adapt/Adapt2b}{\f3\fs22 )\par }{\f5\fs22 • }{\f3\fs22 Type: }{\f4\fs18 shintr -s . mark14b.txt temp.sfm\par }{\f5\fs22 • }{\f3\fs22 Type: }{\f4\fs18 sh_rtf -s . temp.sfm mark14b.rtf\par }{\f5\fs22 • }{\f3\fs22 Type: }{\f4\fs18 start mark14b.rtf }{\f3\fs22 to start Word and open the RTF file.}\par {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 4 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\f3\fs22 Assuming everything has gone to plan, you should see various blocks of interlinear text\par interspersed with comments about the various rearrangement and generation rules used in the\par example.\par }{\b\f1\fs28 Approach Used\par }{\f3\fs22 There are three main approaches used to rendering interlinear text in RTF. They are:\par }{\f5\fs22 • }{\f3\fs22 Use monospaced fonts and use the monospacing to line things up\par }{\f5\fs22 • }{\f3\fs22 Use tables with each interlinear element being a cell\par }{\f5\fs22 • }{\f3\fs22 Use equation fields which allow the stacking of elements, with each interlinear block\par appearing as one letter.\par We use the latter approach in this system for the following reasons:\par }{\f5\fs22 • }{\f3\fs22 The interlinear blocks automatically wrap without having to measure any widths.\par }{\f5\fs22 • }{\f3\fs22 They are much more manipulable when laying out text.\par }{\f5\fs22 • }{\f3\fs22 They can be used within tables (as opposed to being tables in their own right)\par Therefore, if you were to click in the example, on any interlinear element, the whole block would\par select, but just the one block. Also, if you were to change the viewing options to show field\par codes, you would see a horrific mess, which is the underlying work that the interlinear converter\par has done for you. Perhaps it would be better to switch back to normal view fairly smartly!\par }{\b\f1\fs24 Problems\par }{\f3\fs22 Due to a bug in Word}{\f3\fs14 1}{\f3\fs22 , whereby it is not possible to store commas or parentheses in an EQ field\par (despite help’s instructions to the contrary), }{\f4\fs18 shintr }{\f3\fs22 deletes any commas and parentheses it finds.\par You may indicate that it should not do this by using the }{\f4\fs18 -c }{\f3\fs22 option, whereupon, if you are not\par using Word 2000, all the commas and parentheses should reappear, hopefully not messing up the\par layout.\par If you look carefully at the example output, for example under the word }{\b\f7\fs22 de }{\f3\fs22 in the first paragraph,\par you will see that a whole phrase has been rendered under one word rather than across the whole\par phrase. While this is not ideal, it is an artifact of the way that EQ fields work and the complexity\par of phrase level interlinear conversion, a subject we will address in a later section.\par }{\b\f1\fs28 Advanced Useage\par }{\f3\fs22 In this section we look at the various ways in which the tools we have discussed may be\par configured.\par }{\b\f1\fs24 shintr\par }{\f4\fs18 shintr }{\f3\fs22 has various command line options to control such things as inter-column spacing between\par interlinear blocks and inter-line spacing between the elements in a block. Both of these values are\par }{\f3\fs12 1 }{\f3\fs19 I have only seen this bug in Word 2000.}\par {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 5 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\f3\fs22 specified as an integer value of points (1/72 of an inch). The defaults are given, but may be\par overridden using the command line options }{\f4\fs18 -h }{\f3\fs22 and }{\f4\fs18 -l }{\f3\fs22 respectively. Thus, the command line:\par }{\f4\fs18 shintr -h6 -l 8 -s. }{\i\f6\fs18 inputfile outputfile\par }{\f3\fs22 does nothing beyond the normal. Notice that command line options can have space between the\par option and the value, or not. It is unimportant.\par }{\f4\fs18 shintr }{\f3\fs22 also allows for various markers to be ignored (so that you only display the interlinear\par lines of interest). This is done using the }{\f4\fs18 -x }{\f3\fs22 option which is followed by a list of markers\par separated by any non-word forming characters (commas will do!)\par }{\b\f1\fs24 sh_rtf\par }{\f4\fs18 sh_rtf }{\f3\fs22 is a powerful program for converting Shoebox data into RTF. It emulates the Shoebox\par output process and enhances it in a couple of areas.\par The primary enhancement, apart from the magic style }{\f4\fs18 _RTFONLY_}{\f3\fs22 , is to provide some rudimentary\par support for tables. There are two ways in which this is done. The first is that if there is a style\par called }{\f4\fs18 Free Translation }{\f3\fs22 as well as }{\f4\fs18 Interlinear Block }{\f3\fs22 then if you use the }{\f4\fs18 -c }{\f3\fs22 option you can\par specify the width of the free translation column, which will appear as column two, and the width\par of column 1 for the interlinear block will be set such that the total table width is 7".\par The second approach is to embed extra styling information into the marker specification. Since\par Shoebox merrily ignores and removes any markers it does not understand in a database type file,\par we embed the information in the marker description. Embedded information is stored in the\par description in the form of XML empty tags. For example, to indicate which column of a table a\par marker’s paragraph should be inserted into, add the following to the description:\par }{\f4\fs18 <col num="}{\i\f2 number}{\f4\fs18 "/>\par }{\f3\fs22 2 other tags are also required to help in the specification:\par }{\f4\fs18 <totalcols num="}{\i\f2 number}{\f4\fs18 "/>\par <colwidths num="}{\i\f2 number}{\f4\fs18 , }{\i\f2 number}{\f4\fs18 , ..."/>\par totalcols }{\f3\fs22 indicates the number of columns in the table. This is necessary so that for any\par particular column, the converter can work out what to do at the end of the cell, whether this is the\par end of a row, and whether and how many blank cells to insert if intervening cells are not filled.\par }{\f4\fs18 colwidths }{\f3\fs22 is also needed for every marker in a table. It specifies the width (in floating point\par inches) of all the columns. Again this is needed in case the column we are on has to start a new\par table. Since the styles are not linked, all the information is needed for every column: that is,\par assuming you cannot guarantee perfectly structured data.\par Using this approach it should be possible to typeset interlinear text with diglot free translation and\par intervening notes.\par }{\fs24 Other Options\par }{\f3\fs22 Other options for }{\f4\fs18 sh_rtf }{\f3\fs22 are: }{\f4\fs18 -d }{\f3\fs22 allows the user to specify a document template that this\par document should attach to and update styles from on loading into Word. If you need to use SDF,\par then the }{\f4\fs18 -p }{\f3\fs22 option is used to specify the directory to find the }{\f4\fs18 render.dll }{\f3\fs22 in. This is an\par experimental feature and you will need the }{\f4\fs18 Win32::API }{\f3\fs22 module for anything to hope to work!}\par {\fs16 Typesetting Interlinear Text }{\b\f1\fs16 Page 6 of 6\par }{\fs12 Martin Hosken June 13, 2000 Rev: 5\par }{\b\f1\fs28 Interlinear Text\par }{\f3\fs22 This section is a long and somewhat involved explanation of why }{\f4\fs18 shintr }{\f3\fs22 works the way it does.\par It is of only ancillary interest and may be skipped.\par Interlinear text is fun stuff. The traditional approach to interlinearising is that the root of an\par interlinear block is a root word in the source language. This is then transformed into each of the\par forms on each line. The relationship between the transformations is such that the transformation\par path back to the root is such that a tree may be formed with no overlapping branches. If we\par compare the hierarchy of transformations along with the lines on which they occur, we find that\par we have a clean tree. There is the need for an agreement between the order in which interlinear\par lines are rendered and the transformational hierarchy of the interlineariser.\par It is this relationshipe between hierarchies that a tool such as }{\f4\fs18 shintr }{\f3\fs22 has to address and resolve.\par Technologies such as EQ fields in Word, or XML, require data to be in a hierarchy. And if it is\par also necessary to keep line ordering, then the two hierarchies need to be resolved in some way.\par And for the most part, for normal interlinearising, rather than analysis and generation, such a\par model is perfectly sufficient (especially if care is taken over the order of the lines in the\par interlinear block).\par Shoebox uses a different model. Rather than seeing each interlinear element as a node in a\par transformation tree, it sees each transformation as its own independent transformation from one\par line to another, without any relationship to any other transformation. This is a highly flexible\par approach allowing lines to be in any order.\par Apart from the line ordering problem, the two models were easy to integrate for Shoebox versions\par 1-3 where only single word analysis was possible. With the advent of Shoebox 4, things became\par much more complicated due to the ability to do string to string mapping in addition to word to\par word mapping. The result is that we can no longer say that there is any root upon which a\par hierarchy may be built. The relationship between how the word in a line is generated becomes\par divorced from the generation of a word on another line from that word. For example, a string\par which is used as a single node in generating a subsequent line, may itself be made up of multiple\par nodes in its creation.\par The tying of branches together, in this way, is another way in which the hierarchy is broken. Thus\par we have two ways in which the hierarchy can be broken and which such a tool as }{\f4\fs18 shintr }{\f3\fs22 must\par resolve. The first (overlapping branches) is resolved by, effectively, making a lower branch a\par child of the lowest common ancestor branch such that there is no overlap. This is not ideal, and\par some line-up errors may occur, but at least the data is there, and with some careful re-ordering of\par lines to be output, the problem can be resolved. The second problem (conflating branches by\par string replacement) is resolved by making the string replacement relate to the first node in the\par string in the hierarchy. Due to a technique of node conflation (nodes with no children and a\par common parent are conflated into a previous node with the same parent with children), the results\par are not as bad as might first be expected. But there are occasions, and the sample text is one of\par them, where the problem is visually evident.\par These problems could be resolved if true tables were used where cell widths can vary and even\par cells be merged. The problem with a table based output approach is that it is necessary to measure\par the width of strings in order to find out how wide the table is for each block and to split it\par appropriately for line wrapping. On the other hand, EQ fields have the problem that Microsoft\par really don’t want to have to support them, and so they will probably get more buggy as time goes\par on. I’m sure the debate will run and run. In the meantime, I offer this humble tool to perhaps help\par a few people along the way.\par}\par }