The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Shoebox - read and write SIL Shoebox Standard Format (.sf) files

SYNOPSIS

  use Text::Shoebox;
  my $lex = [];
  foreach my $file (@ARGV) {
    read_sf(
      from_file => $file, into => $lex,
    ) or warn "read from $file failed\n";
  }
  print scalar(@$lex), " entries read.\n";
  
  die "hw field-names differ\n"
   unless are_hw_keys_uniform($lex);
  warn "hw field-values aren't unique\n"
   unless are_hw_values_unique($lex);
  
  write_sf(from => $lex, to_file => "merged.sf")
   or die "Couldn't write to merged.sf: $!";

DESCRIPTION

The Summer Institute of Linguistics (http://www.sil.org/) makes a piece of free software called "the Linguist's Shoebox", or just "Shoebox" for short. It's a simple database program generally used for making lexicon databases (altho it can also be used for databases of field notes, etc.).

Shoebox can export its databases to SF (Standard Format) files, a simple text format. Reading and writing those SF files is what this Perl module, Text::Shoebox, is for. (I have heard that Standard Format predates Shoebox quite a bit, and is used by other programs. If you use SF files with something other than Shoebox, I'd be interested in hearing about it, particularly about whether such files and Text::Shoebox are happily compatible.)

OBJECT-ORIENTED INTERFACE

This module provides a functional interface. If you want an object-oriented interface, with a bit more convenience, then see the classes Text::Shoebox::Lexicon and Text::Shoebox::Entry.

FUNCTIONS

$lex_lol = read_sf(...options...)

Reads entries in Standard Format from the source specified. If no entries were read, returns false. Otherwise, returns a reference to the array that the entries were added to (which will be a new array, unless the "into" option is set). If there's an I/O error while reading (like if you specify an unreadable file), then this routine dies.

The options are:

from_file => STRING

This specifies that the source of the SF data is a file, whose filespec is given.

from_handle => FILEHANDLE

This specifies that the source of the SF data is a given filehandle. (Examples of filehandle values: a global filehandle passed either like *MYFH{IO} or *MYFH; or an object value from an IO class like IO::Socket or IO::Handle.)

The filehandle isn't closed when all its data is read.

rs => STRING

This specifies that the given string should be used as the record separator (newline string) for the data source being read from.

If the SF source is specified by a "from_file" option, and you don't specify an "rs" option, then Text::Shoebox will try guessing the line format of the file by reading the first 2K of the file and looking for a CRLF ("\cm\cj"), an LF ("\cj"), or a CR ("\cm"). If you need to stop it from trying to guess, just stipulate an "rs" value of $/.

If the SF source is specified by a "from_handle" option, and you don't specify an "rs" option, then Text::Shoebox will just use the value in the Perl variable $/ (the global RS value).

into => ARRAYREF

If this option is stipulated, then entries read will be pushed to the end of the array specified. Otherwise the entries will be put into a new array.

Example use:

  use Text::Shoebox;
  my $lexicon = read_sf(from_file => 'foo.sf')
   || die "No entries?";
  print scalar(@$lexicon), " entries read.\n";
  print "First entry has ",
   @{ $lexicon->[0] } / 2 , " fields.\n";
write_sf(...options...)

This writes the given lexicon, in Standard Format, to the destination specified. If all entries were written, returns true; otherwise (in case of an IO error), returns false, in which case you should check $!. Note that this routine doesn't die in the case of an I/O error, so you should always check the return value of this function, as with:

  write_sf(...) || die "Couldn't write: $!";

The options are:

from => ARRAYREF

This option must be present, to specify the lexicon that you want to write out.

to_file => STRING

This specifies that the SF data is to be written to the file specified. (Note that the file is opened in overwrite mode, not append mode.)

to_handle => FILEHANDLE

This specifies that the destination for the SF data is the given filehandle.

The filehandle isn't closed when all the data is written to it.

rs => STRING

This specifies that the given string should be used as the record separator (newline string) for the SF data written.

If not specified, defaults to "\n".

are_hw_keys_uniform($lol)

This function returns true iff all the entries in the lexicon have the same key for their headword fields (i.e., the first field per record). This will always be true if you read the lexicon from one file; but if you read it from several, it's possible that the different files have different keys marking headword fields.

are_hw_values_unique($lex_lol)

This function returns true iff all the headword values in all non-null entries in the lexicon $lol are unique -- i.e., if no two (or more) entries have the same values for their headword fields. I don't know if uniqueness is a requirement for SF lexicons that you'd want to import into Shoebox, but some tasks you put lexicons to might require it.

A NOTE ABOUT VALIDITY

I make very few assumptions about what characters can be in a field key in SF files. Just now, I happen to assume they can't start with an underscore (lest they be considered comments), and can't contain any whitespace characters.

I make essentially no assumptions about what can be in a field value, except that there can be no newline followed immediately by a backslash. (Any newline-backslash sequence in turned into newline-space-backslash.)

You should be aware that Shoebox, or whatever other programs use SF files, may have a much more restricted view of what can be in a field key or value.

SEE ALSO

Text::Shoebox::Lexicon

Text::Shoebox::Entry

COPYRIGHT

Copyright 2000-2004, Sean M. Burke sburke@cpan.org, all rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Sean M. Burke, sburke@cpan.org

Please contact me if you find that this module is not behaving correctly. I've tested it only on Shoebox files I generate on my own.

I hasten to point out, incidentally, that I am not in any way affiliated with the Summer Institute of Linguistics.