Tara Andrews > Text-TEI-Markup-1.0 > Text::TEI::Markup



Annotate this POD

View/Report Bugs
Module Version: 1.0   Source   Latest Release: Text-TEI-Markup-1.9


Text::TEI::Markup - a transcription markup syntax for TEI XML


 use Text::TEI::Markup qw( to_xml );
 my $xml_string = to_xml( file => $markup_file, 
    template => $template_xml_string,
    %opts );  # see below for available options

 use Text::TEI::Markup qw( word_tag_wrap );
 my $word_wrapped_xml = word_tag_wrap( $tei_xml_string );


TEI XML is a wonderful thing. The elements defined therein allow a transcriber to record and represent just about any feature of a text that he or she encounters.

The problem is the transcription itself. When I am transcribing a manuscript, especially if that manuscript is in a bunch of funny characters on the keymap for another language, I do not want to be switching back and forth between keyboard layouts in order to type "<tag attr="attr>" arrow-arrow-arrow-arrow-arrow "</tag> every six seconds. It's prone to typo, it's astonishingly slow, and it makes my wrists hurt just to think about it. I also don't really want to fire up an XML editor, select the words or characters that need to be tagged, and click a lot. That way is not prone to typo, but it's still pretty darn slow, and it makes my wrists hurt even more to think about.

Text::TEI::Markup is my solution to that problem. It defines a bunch of single- or double-character sigils that represent tags. These are a lot faster and easier to type; I don't have to worry about typos; and I can do it all with a plain text editor, thus minimizing use of the mouse.

I have tried to pick sigils that don't conflict with characters that are found in manuscripts. I have succeeded for my particular set of manuscripts, but I have not succeeded for the general case. If you like the idea behind this module, you are still almost guaranteed to hate the sigils I've picked. That's okay; you can re-define them.

Extra bonus solution: word wrapping with <w/> and <seg/>

Even if you are happy as a clam in the graphical XML editor of your choice, this module exports a function that may be useful to you. The TEI P5 guidelines include a module called "analysis", which allows the user to tag sentences, clauses, words, morphemes, or any other sort of semantic segment of a text. This is really good for programmatic applications, but very boring and repetitive to have to tag.

The function word_tag_wrap solves part of this problem for you. It takes an XML string as input, looks for words (defined by whitespace separation) and returns an XML string with each of these words wrapped in an appropriate tag. If the word has complex elements (e.g. editorial expansion), it will be wrapped in a <seg type="word/> tag. If not, it will be in a simple <w/> tag. It handles line breaks and page breaks within words, as long as there is no trailing whitespace before the <lb/> (or <pb/>) tag, and as long as the whitespace after the tag contains a carriage return.


The input file has a header and a body. The header begins with a '=HEAD' tag, and consists of a colon-separated list of key_value pairs. These keys, which are case insensitive, get directly substituted into an XML template; the idea is that your TEI header won't change very much between files, so you write it once with template values, pass it to &to_xml, and the substitution happens as if by magic. The keyword /MAIN/i is reserved for the content between the <body></body> tags - that is, all the content that will be generated after the '=BODY' tag.

A very simple template looks like this: <?xml version="1.0" encoding="UTF-8"> <TEI> <teiHeader> <fileDesc> <titleStmt> <title>__TITLE__</title> <author__AUTHOR__</author> <respStmt xml:id="#__MYINITIALS__"> <resp>Transcription by</resp> <name>__MYNAME__</name> </respStmt> </titleStmt> </fileDesc> </teiHeader> <text> <body> __MAIN__ </body> </text> </TEI>

Your input file should then begin something like this:

 title:My Summer Vacation: a novel
 author:John Smith
 myname:Tara L Andrews
 The real text begins here.

The real work begins after the '=BODY' tag. The currently-defined sigil list is:

 %SIGILS = ( 
    'comment' => '##',
    'add' => '+',
    'del' => '-',
    'subst' => "\x{b1}",   # Unicode PLUS-MINUS SIGN
    'div' => "\x{a7}",     # Unicode SECTION SIGN
    'p' => "\x{b6}",       # Unicode PILCROW SIGN
    'ex' => '\\',
    'expan' => '^',
    'abbr' => [ '{', '}' ],
    'num' => '%',
    'pb' => [ '[', ']' ],
    'hi' => '*',

Non-identical matched sets of sigla (e.g. '{}' for abbreviations) should be specified in a listref, as seen here.

Whitespace is only significant at the end of lines. If a line which contains non-tag text (i.e. words) ends in whitespace, it is assumed that the previous word is a complete word. If the line ends with a non-whitespace character, it is assume that the word continues onto the next line.

All the sigils must be balanced, and they must nest properly. Remember that this is a shorthand for XML. I could be convinced to try to autocorrect some unbalanced sigils, but it would be worth at least a few pints of cider (or, of course, a patch.)


to_xml( file => '$filename', %opts );

Takes the name of a file that holds a marked-up version of text. Returns a TEI XML string to represent that text.

Options include:


a string containing the XML template that you want to use for the markup. If none is specified, there is a default. That default is useful for me, but is very unlikely to be useful for you.


a mode string to pass to the open() call on the file. Default "<:utf8".


a subroutine ref that will calculate the value of number representations. Useful for, e.g., Latin numerals. This is optional - if nothing is passed, no number value calculation will be attempted.


a hashref containing the preferred sigil representations of TEI tags. Defaults to the list above.


Defaults to "true". If you pass a false value, the word wrapping will be skipped.


Defaults to 0. Controls whether rudimentary formatting is applied to the XML returned. Possible values are 0, 1, and "more than 1". See XML::LibXML::Document::to_string for more information. (Personally I just xmllint it separately.)

The return string is run through the basic formatting mechanism provided by XML::LibXML. You may wish to pass it through a pretty printer more to your taste.

word_tag_wrap( $xml_string )

Takes a string containing a TEI XML document, and returns that document with all its words wrapped in <w/> (or <seg/>) tags. A "word" is defined as a series of text characters separated by whitespace. A word can have a line break, or even a page break, in the middle; if this is the case, there may not be any whitespace between the end of the first word segment and the <lb/> (or <pb/>) tag. Conversely, there must be whitespace separating the <lb/> (or <pb/>) from a complete word.


The XML is not currently validated against a schema. This is mostly because I have been unable to get RelaxNG validation to work against certain TEI schemas.

This module is currently in a state that I know to be useful to me. If it looks like it might be useful to you, but something is bugging you about it, report it!


This package is free software and is provided "as is" without express or implied warranty. You can redistribute it and/or modify it under the same terms as Perl itself.


Tara L Andrews, aurum@cpan.org

syntax highlighting: