The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

NLP::GATE::Document - Class for manipulating GATE-like documents

VERSION

Version 0.6

SYNOPSIS

  use NLP::GATE::Document;
  my $doc = NLP::GATE::Document->new();
  $doc->setText($text);
  $doc->setFeature($name,"featvalue");
  $doc->setFeatureType($name,$type);
  $annset = $doc->getAnnotationSet($setname);
  $doc->setAnnotationSet($set,$setname);
  $feature = $doc->getFeature($name);
  $type = $doc->getFeatureType($name);
  $xml = $doc->toXML();
  $doc->fromXMLFile($filename);
  $doc->fromXML($string);

DESCRIPTION

This is a simple class representing a document with annotations and features similar to how documents are represented in GATE. The class can produce a string representation of the document that is in XML format and should be readable by GATE.

All setter functions return the original Document object.

METHODS

new()

Create a new document. Currently only can be used without parameters and will always create a new empty document.

setText($text)

Set the text of the document. Note that annotations will remain unchanged unless you explicitly remove them (see setAnnotation) and might point to non-existing or incorrect text after the text is changed.

appendText($theText)

Append text to the current text content of the document. In scalar context, returns the document object. In array context, returns the from and to offsets of the newly added text. This can be used to add annotations for that text snipped more easily.

getText()

Get the plain text of the document.

getTextForAnnotation($annotation)

Get the text spanned by the given annotation

TODO: no sanity checks yet!

getAnnotationSet ($name)

Return the annotation set with that name. Return undef if no set with such a name is found.

This is more straightforward than the original Java implementation in GATE: passing an empty string or undef as $name will return the default annotation set.

getAnnotationSetNames

Return a list of known annotation set names. This will include an entry that is the empty string that stands for the default annotation set.

setAnnotationSet ($set[,$name])

Store the annotation set object with the document under the given annotation set name. If the name is the empty string or undef, the default annotation set is stored or replaced. Any existing annotation set with that name will be destroyed (unless the object to replace it is the original set object).

setFeature($name,$value)

Add or replace the document feature of the given name with the new value. Make sure you at least add the usual GATE standard features to a document:

   setFeature('gate.SourceURL','created from String');

getFeature($name)

Return the value of the document feature with that name.

setFeatureType($name,$type)

Set the Java type for the feature.

getFeatureType($name)

Return the Java type for a feature. If the type has never been set, the default is java.lang.String.

fromXMLFile($filename)

Read a GATE document from an XML file. All content of the current object, including features, annotations and text is discarded.

fromXML($string)

Read a GATE document from a string that contains a GATE document into the current object. All previous content of the object is discarded. The XML string has to be encoded in UTF8 for now.

toXML()

Create an actual XML representation that can be used by GATE from the internal representation of the document.

AUTHOR

Johann Petrak, <firstname.lastname-at-jpetrak-dot-com>

BUGS

Please report any bugs or feature requests to bug-gate-document at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=NLP::GATE. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc NLP::GATE

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2007 Johann Petrak, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.