Thierry Hamon > Alvis-NLPPlatform > Alvis::NLPPlatform

Download:
Alvis-NLPPlatform-0.6.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  0
View Bugs
Report a bug
Module Version: 0.6   Source  

NAME ^

Alvis::NLPPlatform - Perl extension for linguistically annotating XML documents in Alvis

SYNOPSIS ^

DESCRIPTION ^

This module is the main part of the Alvis NLP platform. It provides overall methods for the linguistic annotation of web documents. Linguistic annotations depend on the configuration variables and dependencies between linguistic steps.

Input documents are assumed to be in the ALVIS XML format (standalone_main) or to be loaded in a hashtable (client_main). The annotated document is recorded in the given descriptor (standalone_main) or returned as a hashtable (client_main).

The ALVIS format is described here:

http://www.alvis.info/alvis/Architecture_2fFormats?action=show&redirect=architecture%2Fformats#documents

The DTD and XSD are provied in etc/alvis-nlpplatform.

Linguistic annotation: requirements ^

  1.  Tokenization: this step has no dependency. It is required for
             any following annotation level.
  2.  Named Entity Tagging: this step requires tokenization. 
  3.  Word segmentation: this step requires tokenization.
             The  Named Entity Tagging step is recommended to improve the segmentation.
  4.  Sentence segmentation: this step requires tokenization.
             The  Named Entity Tagging step is recommended to improve the segmentation. 
  5.  Part-Of-Speech Tagging: this step requires tokenization, and word and
     sentence segmentation.
  6.  Lemmatization: this step requires tokenization, 
    word and sentence segmentation, and Part-of-Speech tagging.
  7.  Term Tagging: this step requires tokenization, 
    word and sentence segmentation, and Part-of-Speech tagging. Lemmatization is recommended to improve the term recognition.
  8.  Parsing: this step requires tokenization, word and sentence
    segmentation.  Term tagging is recommended to improve the parsing of noun phrases.
  9.  Semantic feature tagging: To be determined
  10.  Semantic relation tagging: To be determined
  11.  Anaphora resolution: To be determined

METHODS ^

compute_dependencies()

    compute_dependencies($hashtable_config);

This method processes the configuration variables defining the linguistic annotation steps. $hash_config is the reference to the hashtable containing the variables defined in the configuration file. The dependencies of the linguistic annotations are then coded. For instance, asking for POS annotation will imply tokenization, word and sentence segmentations.

starttimer()

    starttimer()

This method records the current date and time. It is used to compute the time of a processing step.

endtimer()

    endtimer();

This method ends the timer and returns the time of a processing step, according to the time recorded by starttimer().

linguistic_annotation()

    linguistic_annotation($h_config,$doc_hash);

This methods carries out the lingsuitic annotation according to the list of required annotations. Required annotations are defined by the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file).

The document to annotate is passed as a hash table ($doc_hash). The method adds annotation to this hash table.

standalone()

    standalone($config, $HOSTNAME, $doc);

This method is used to annotate a document in the standalone mode of the platform. The document $doc is given in the ALVIS XML format.

The reference to the hashtable $config contains the configuration variables. The variable $HOSTNAME is the host name.

The method returns the annotation document.

standalone_main()

    standalone_main($hash_config, $doc_xml, \*STDOUT);

This method is used to annotate a document in the standalone mode of the platform. The document (%doc_xml) is given in the ALVIS XML format.

The document is loaded into memory and then annotated according to the steps defined in the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file). The annotated document is printed to the file defined by the descriptor given as parameter (in the given example, the standard output). $printCollectionHeaderFooter indicates if the documentCollection header and footer have to be printed.

The function returns the time of the XML rendering.

client_main()

    client_main($doc_hash, $r_config);

This method is used to annotate a document in the distributed mode of the NLP platform. The document given in the ALVIS XML format is already is loaded into memory ($doc_hash).

The document is annotated according to the steps defined in the configuration variables. The annotated document is returned to the calling method.

load_config()

    load_config($rcfile);

The method loads the configuration of the NLP Platform by reading the configuration file given in argument.

print_config()

    print_config($config);

The method prints the configuration loaded from a file and contained in the hash reference $config.

client()

  client($rcfile)

This is the main method for the client process. $rcfile is the file name containing the configuration.

sigint_handler()

    sigint_handler($signal);

This method is used to catch the INT signal and send a ABORTING message to the server.

server()

  server($rcfile)

This is the main method for the server process. $rcfile is the file name containing the configuration.

disp_log()

    disp_log($hostname,$message);

This method prints the message ($message) on the standard error output, in a formatted way:

date: (client=hostname) message

split_to_docRecs()

    split_to_docRecs($xml_docs);

This method splits a list of documents into a table and return it. Each element of the table is a two element table containing the document id and the document.

sub_dir_from_id()

    sub_dir_from_id($doc_id)

Ths method returns the subdirectory where annotated document will stored. It computes the subdirectory from the two first characters of the document id ($doc_id).

record_id()

    record_id($doc_id, $r_config);

This method records in the file $ALVISTMP/.proc_id, the id of the document that has been sent to the client.

delete_id()

    delete_id($doc_id,$r_config);

This method delete the id of the document that has been sent to the client, from the file $ALVISTMP/.proc_id.

init_server()

    init_server($r_config);

This method initializes the server. It reads the document id from the file $ALVISTMP/.proc_id and loads the corresponding documents i.e. documents which have been annotated but not recorded due to a server crash.

token_id_is_in_list_refid_token()

    token_id_is_in_list_refid_token($list_refid_token, $token_to_search);

The method returns 1 if the token $token_to_search is in the list $list_refid_token, 0 else.

token_id_follows_list_refid_token()

    token_id_follows_list_refid_token($list_refid_token, $token_to_search);

The method returns 1 if the token $token_to_search is the foollwing of the last token of the list $list_refid_token, 0 else.

token_id_just_before_last_of_list_refid_token()

    token_id_just_before_last_of_list_refid_token($list_refid_token, $token_to_search);

The method returns 1 if the token $token_to_search is just before the first token of the list $list_refid_token, 0 else.

unparseable_id()

   unparseable_id($id)

The method checks if the id have been parsed or not. If not, it prints a warning.

platform_reset()

   platform_reset()

The method empties or resets the structures and variables attached to a processed document.

PLATFORM CONFIGURATION ^

The configuration file of the NLP Platform is composed of global variables and divided into several sections:

DEFAULT INTEGRATED/WRAPPED NLP TOOLS ^

Several NLP tools have been integrated in wrappers. In this section, we summarize how to download and install the NLP tools used by default in the Alvis::NLPPlatform::NLPWrappers.pm module. We also give additional information about the tools.

Named Entity Tagger

We integrated TagEn as the default named entity tagger.

Word and sentence segmenter

The Word and sentence segmenter we use by default is a awk script sent by Gregory Grefenstette on the Corpora mailing list. We modified it to segmentize French texts.

Part-of-Speech Tagger

The default wrapper call the TreeTagger. This tool is a Part-of-Speech tagger and lemmatizer.

Term Tagger

We have integrated a tool developed specifically for the Alvis project.It is required while installing the platform.

Part-of-Speech specialized for Biological texts

GeniaTagger (POS and lemma tagger):

Parser

Link Grammar Parser:

Parser specialized for biological texts

BioLG:

TUNING THE NLP PLATFORM ^

The main characteristic of the NLP platform is its tunability according to the domain (language specificity of the documents to be annotated) and the user requirements. The tuning can be done at two levels:

PROTOCOL ^

SEE ALSO ^

Alvis web site: http://www.alvis.info

Description of the input/output format: http://www.alvis.info/alvis/Architecture_2fFormats?action=show&redirect=architecture%2Fformats#documents

AUTHORS ^

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>

LICENSE ^

Copyright (C) 2005 by Thierry Hamon and Julien Deriviere

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.