Tara Andrews > Text-TEI-Markup > Text::TEI::Markup



Annotate this POD

View/Report Bugs
Module Version: 1.8.1   Source  


Text::TEI::Markup - a transcription markup syntax for TEI XML


 use Text::TEI::Markup qw( to_xml );
 my $xml_string = to_xml( file => $markup_file, 
        template => $template_xml_string,
        %opts );  # see below for available options

 use Text::TEI::Markup qw( word_tag_wrap );
 my $word_wrapped_xml = word_tag_wrap( $tei_xml_string );


TEI XML is a wonderful thing. The elements defined therein allow a transcriber to record and represent just about any feature of a text that he or she encounters.

The problem is the transcription itself. When I am transcribing a manuscript, especially if that manuscript is in a bunch of funny characters on the keymap for another language, I do not want to be switching back and forth between keyboard layouts in order to type "<tag attr="attr>" arrow-arrow-arrow-arrow-arrow "</tag> every six seconds. It's prone to typo, it's astonishingly slow, and it makes my wrists hurt just to think about it. I also don't really want to fire up an XML editor, select the words or characters that need to be tagged, and click a lot. That way is not prone to typo, but it's still pretty darn slow, and it makes my wrists hurt even more to think about.

Text::TEI::Markup is my solution to that problem. It defines a bunch of single- or double-character sigils that represent tags. These are a lot faster and easier to type; I don't have to worry about typos; and I can do it all with a plain text editor, thus minimizing use of the mouse.

I have tried to pick sigils that don't conflict with characters that are found in manuscripts. I have succeeded for my particular set of manuscripts, but I have not succeeded for the general case. If you like the idea behind this module, you are still almost guaranteed to hate the sigils I've picked. That's okay; you can re-define them.

Extra bonus solution: word wrapping with <w/> and <seg/>

Even if you are happy as a clam in the graphical XML editor of your choice, this module exports a function that may be useful to you. The TEI P5 guidelines include a module called "analysis", which allows the user to tag sentences, clauses, words, morphemes, or any other sort of semantic segment of a text. This is really good for programmatic applications, but very boring and repetitive to have to tag.

The function word_tag_wrap solves part of this problem for you. It takes an XML string as input, looks for words (defined by whitespace separation) and returns an XML string with each of these words wrapped in an appropriate tag. If the word has complex elements (e.g. editorial expansion), it will be wrapped in a <seg type="word/> tag. If not, it will be in a simple <w/> tag. It handles line breaks and page breaks within words, as long as there is no trailing whitespace before the <lb/> (or <pb/>) tag, and as long as the whitespace after the tag contains a carriage return.


The input file has a header and a body. The header begins with a '=HEAD' tag, and consists of a colon-separated list of key_value pairs. These keys, which are case insensitive, get directly substituted into an XML template; the idea is that your TEI header won't change very much between files, so you write it once with template values, pass it to &to_xml, and the substitution happens as if by magic. The keyword /MAIN/i is reserved for the content between the <body></body> tags - that is, all the content that will be generated after the '=BODY' tag.

A very simple template looks like this:

 <?xml version="1.0" encoding="UTF-8">
                 <respStmt xml:id="#__MYINITIALS__">
                   <resp>Transcription by</resp>

Your input file should then begin something like this:

 title:My Summer Vacation: a novel
 author:John Smith
 myname:Tara L Andrews
 The ^real^ text b\e\gins +(above)t+here.

The real work begins after the '=BODY' tag. The currently-defined sigil list is:

 %SIGILS = ( 
        'comment' => '##',
        'add' => '+',
        'del' => '-',
        'subst' => "\x{b1}",    # Unicode PLUS-MINUS SIGN
        'div' => "\x{a7}",              # Unicode SECTION SIGN
        'p' => "\x{b6}",                # Unicode PILCROW SIGN
        'ex' => '\\',
        'expan' => '^',
        'supplied' => '@',
        'abbr' => [ '{', '}' ],
        'num' => '%',
        'pb' => [ '[', ']' ],
        'cb' => '|',
        'hi' => '*',
        'unclear' => '?',
        'q' => "\x{2020}",              # Unicode DAGGER

Non-identical matched sets of sigla (e.g. '{}' for abbreviations) should be specified in a listref, as seen here.

Whitespace is only significant at the end of lines. If a line which contains non-tag text (i.e. words) ends in whitespace, it is assumed that the previous word is a complete word. If the line ends with a non-whitespace character, it is assume that the word continues onto the next line.

All the sigils must be balanced, and they must nest properly. Remember that this is a shorthand for XML. I could be convinced to try to autocorrect some unbalanced sigils, but it would be worth at least a few pints of cider (or, of course, a patch.)

Tag arguments

Certain of the tags can be passed extra arguments:

add / del

Anything that appears in parentheses immediately after the add/del opening sigil ( + or - in the examples above) will get added as an attribute. If the string in parentheses has no '=' sign in it, the attribute for the "add" tag will be "place", and the attribute for the "del" tag will be "type". Ergo:

 +(margin)This is an addition+
 -(overwrite)and a deletion- to the sentence.

will get translated to

 <add place="margin">This is an addition</add> 
 <del type="overwrite">and a deletion</del> to the sentence.

This behavior ought to be more configurable and/or flexible; make it worth my while.


A number value can calculated using a number_conversion function, or it can simply be specified. It is also possible to specify the type of number being represented (cardinal, ordinal, fraction, percentage). The arguments are separated with a comma, and in the order "value", "type". So for example:

 The lead was taken by the Exeter %(8)VIII%. This was their 
 %(13,ord)thirteenth% straight win.

will become:

 The lead was taken by the Exeter <num value="8">VIII</num>. This was their 
 <num value="13" type="ordinal">thirteenth</num> straight win.

When text highlighting is encoded, it is almost always a good idea to say something about how the highlight was rendered. This information can be passed as an argument:

 *(red)IN the beginning* was the word

will become

 <hi rend="red">IN the beginning</hi> was the word


to_xml( file => '$filename', %opts );

Takes the name of a file that holds a marked-up version of text. Returns a TEI XML string to represent that text. Options include:


a string containing the XML template that you want to use for the markup. If none is specified, there is a default. That default is useful for me, but is very unlikely to be useful for you. =item fileopen_mode

a mode string to pass to the open() call on the file. Default "<:utf8".


a subroutine ref that will calculate the value of number representations. Useful for, e.g., Latin numerals. This is optional - if nothing is passed, no number value calculation will be attempted. =item sigils

a hashref containing the preferred sigil representations of TEI tags. Defaults to the list above.


Defaults to "true". If you pass a false value, the word wrapping will be skipped.


Defaults to 0. Controls whether rudimentary formatting is applied to the XML returned. Possible values are 0, 1, and "more than 1". See XML::LibXML::Document::serialize for more information. (Personally I just xmllint it separately.)

The return string is run through the basic formatting mechanism provided by XML::LibXML. You may wish to pass it through a pretty printer more to your taste.

word_tag_wrap( $xml_string )

Takes a string containing a TEI XML document, and returns that document with all its words wrapped in <w/> (or <seg/>) tags. A "word" is defined as a series of text characters separated by whitespace. A word can have a line break, or even a page break, in the middle; if this is the case, there may not be any whitespace between the end of the first word segment and the <lb/> (or <pb/>) tag. Conversely, there must be whitespace separating the <lb/> (or <pb/>) from a complete word.


The XML is not currently validated against a schema. This is mostly because I have been unable to get RelaxNG validation to work against certain TEI schemas.

This module is currently in a state that I know to be useful to me. If it looks like it might be useful to you, but something is bugging you about it, report it!


This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.


Tara L Andrews, aurum@cpan.org

syntax highlighting: