Alvis::NLPPlatform - Perl extension for linguistically annotating XML documents in Alvis
Standalone mode:
use Alvis::NLPPlatform; Alvis::NLPPlatform::standalone_main(\%config, $doc_xml, \*STDOUT);
Distributed mode:
# Server process use Alvis::NLPPlatform; Alvis::NLPPlatform::server($rcfile); # Client process use Alvis::NLPPlatform; Alvis::NLPPlatform::client($rcfile);
This module is the main part of the Alvis NLP platform. It provides overall methods for the linguistic annotation of web documents. Linguistic annotations depend on the configuration variables and dependencies between linguistic steps.
Input documents are assumed to be in the ALVIS XML format (standalone_main) or to be loaded in a hashtable (client_main). The annotated document is recorded in the given descriptor (standalone_main) or returned as a hashtable (client_main).
standalone_main
client_main
The ALVIS format is described here:
http://www.alvis.info/alvis/Architecture_2fFormats?action=show&redirect=architecture%2Fformats#documents
The DTD and XSD are provied in etc/alvis-nlpplatform.
Tokenization: this step has no dependency. It is required for any following annotation level.
Named Entity Tagging: this step requires tokenization.
Word segmentation: this step requires tokenization. The Named Entity Tagging step is recommended to improve the segmentation.
Sentence segmentation: this step requires tokenization. The Named Entity Tagging step is recommended to improve the segmentation.
Part-Of-Speech Tagging: this step requires tokenization, and word and sentence segmentation.
Lemmatization: this step requires tokenization, word and sentence segmentation, and Part-of-Speech tagging.
Term Tagging: this step requires tokenization, word and sentence segmentation, and Part-of-Speech tagging. Lemmatization is recommended to improve the term recognition.
Parsing: this step requires tokenization, word and sentence segmentation. Term tagging is recommended to improve the parsing of noun phrases.
Semantic feature tagging: To be determined
Semantic relation tagging: To be determined
Anaphora resolution: To be determined
compute_dependencies($hashtable_config);
This method processes the configuration variables defining the linguistic annotation steps. $hash_config is the reference to the hashtable containing the variables defined in the configuration file. The dependencies of the linguistic annotations are then coded. For instance, asking for POS annotation will imply tokenization, word and sentence segmentations.
$hash_config
starttimer()
This method records the current date and time. It is used to compute the time of a processing step.
endtimer();
This method ends the timer and returns the time of a processing step, according to the time recorded by starttimer().
linguistic_annotation($h_config,$doc_hash);
This methods carries out the lingsuitic annotation according to the list of required annotations. Required annotations are defined by the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file).
The document to annotate is passed as a hash table ($doc_hash). The method adds annotation to this hash table.
$doc_hash
standalone($config, $HOSTNAME, $doc);
This method is used to annotate a document in the standalone mode of the platform. The document $doc is given in the ALVIS XML format.
$doc
The reference to the hashtable $config contains the configuration variables. The variable $HOSTNAME is the host name.
$config
$HOSTNAME
The method returns the annotation document.
standalone_main($hash_config, $doc_xml, \*STDOUT);
This method is used to annotate a document in the standalone mode of the platform. The document (%doc_xml) is given in the ALVIS XML format.
%doc_xml
The document is loaded into memory and then annotated according to the steps defined in the configuration variables ($hash_config is the reference to the hashtable containing the variables defined in the configuration file). The annotated document is printed to the file defined by the descriptor given as parameter (in the given example, the standard output). $printCollectionHeaderFooter indicates if the documentCollection header and footer have to be printed.
$printCollectionHeaderFooter
documentCollection
The function returns the time of the XML rendering.
client_main($doc_hash, $r_config);
This method is used to annotate a document in the distributed mode of the NLP platform. The document given in the ALVIS XML format is already is loaded into memory ($doc_hash).
The document is annotated according to the steps defined in the configuration variables. The annotated document is returned to the calling method.
load_config($rcfile);
The method loads the configuration of the NLP Platform by reading the configuration file given in argument.
print_config($config);
The method prints the configuration loaded from a file and contained in the hash reference $config.
client($rcfile)
This is the main method for the client process. $rcfile is the file name containing the configuration.
$rcfile
sigint_handler($signal);
This method is used to catch the INT signal and send a ABORTING message to the server.
server($rcfile)
This is the main method for the server process. $rcfile is the file name containing the configuration.
disp_log($hostname,$message);
This method prints the message ($message) on the standard error output, in a formatted way:
$message
date: (client=hostname) message
split_to_docRecs($xml_docs);
This method splits a list of documents into a table and return it. Each element of the table is a two element table containing the document id and the document.
sub_dir_from_id($doc_id)
Ths method returns the subdirectory where annotated document will stored. It computes the subdirectory from the two first characters of the document id ($doc_id).
$doc_id
record_id($doc_id, $r_config);
This method records in the file $ALVISTMP/.proc_id, the id of the document that has been sent to the client.
$ALVISTMP/.proc_id
delete_id($doc_id,$r_config);
This method delete the id of the document that has been sent to the client, from the file $ALVISTMP/.proc_id.
init_server($r_config);
This method initializes the server. It reads the document id from the file $ALVISTMP/.proc_id and loads the corresponding documents i.e. documents which have been annotated but not recorded due to a server crash.
token_id_is_in_list_refid_token($list_refid_token, $token_to_search);
The method returns 1 if the token $token_to_search is in the list $list_refid_token, 0 else.
$token_to_search
$list_refid_token
token_id_follows_list_refid_token($list_refid_token, $token_to_search);
The method returns 1 if the token $token_to_search is the foollwing of the last token of the list $list_refid_token, 0 else.
token_id_just_before_last_of_list_refid_token($list_refid_token, $token_to_search);
The method returns 1 if the token $token_to_search is just before the first token of the list $list_refid_token, 0 else.
unparseable_id($id)
The method checks if the id have been parsed or not. If not, it prints a warning.
platform_reset()
The method empties or resets the structures and variables attached to a processed document.
The configuration file of the NLP Platform is composed of global variables and divided into several sections:
Global variables.
The two mandatory variables are ALVISTMP and PRESERVEWHITESPACE (in the XML_INPUT section).
ALVISTMP
PRESERVEWHITESPACE
ALVISTMP : it defines the temporary directory used during the annotation process. The files are recorded in (XML files and input/output of the NLP tools) during the annotation step. It must be writable to the user the process is running as.
DEBUG : this variable indicates if the NLP platform is run in a debug mode or not. The value are 1 (debug mode) or 0 (no debug mode). Default value is 0. The main consequence of the debug mode is to keep the temporary file.
DEBUG
Additional variables and environement variables can be used if they are interpolated in the configuration file. For instance, in the default configuration file, we add
PLATFORM_ROOT: directory where are installed NLP tools and resources.
PLATFORM_ROOT
NLP_tools_root: root directory where are installed the NLP tools
NLP_tools_root
AWK: path for awk
AWK
SEMTAG_EN_DIR: directory where is installed the semantic tagger
SEMTAG_EN_DIR
ONTOLOGY: path for the ontology for the semanticTypeTagger (trish2 format -- see documentation of the semanticTypeTagger)
ONTOLOGY
CANONICAL_DICT: path for the dictionary with the canonical form of the semantic units (trish2 format -- see documentation of the semanticTypeTagger)
CANONICAL_DICT
PARENT_DICT:: path for the dictionary with the parent nodes of the semantic units (trish2 format -- see documentation of the semanticTypeTagger)
PARENT_DICT
Section alvis_connection
alvis_connection
HARVESTER_PORT: the port of the harverster/crawler (combine) that the platform will read from to get the documents to annotate.
HARVESTER_PORT
combine
NEXTSTEP: indicates if there is a next step in the pipeline (for instance, the indexer IdZebra). The value is 0 or 1.
NEXTSTEP
0
1
NEXTSTEP_HOST: the host name of the component that the platform will send the annotated document to.
NEXTSTEP_HOST
NEXTSTEP_PORT: the port of the component that the platform will send the annotated document to.
NEXTSTEP_PORT
SPOOLDIR: the directory where the documents coming from the harvester are stored.
SPOOLDIR
It must be writable to the user the process is running as.
OUTDIR: the directory where are stored the annotated documents if SAVE_IN_OUTDIR (in Section NLP_misc) is set.
OUTDIR
SAVE_IN_OUTDIR
NLP_misc
Section NLP_connection
NLP_connection
SERVER: The host name where the NLP server is running, for the connections with the NLP clients.
SERVER
PORT: The listening port of the NLP server, for the connections with the NLP clients.
PORT
RETRY_CONNECTION: The number of times that the clients attempts to connect to the server.
RETRY_CONNECTION
XML_INPUT
PRESERVEWHITESPACE is a boolean indicating if the linguistic annotation will be done by preserving white space or not, i.e. XML blank nodes and white space at the beginning and the end of any line, but also indentation of the text in the canonicalDocument
Default value is 0 or false (blank nodes and indentation characters are removed).
LINGUISTIC_ANNOTATION_LOADING: The linguistic annotations already existing in the input documents are loaded or not. Default value is c<1> or true (linguistic annotations are loaded).
LINGUISTIC_ANNOTATION_LOADING
XML_OUTPUT (Not available yet)
XML_OUTPUT
NO_STD_XML_OUTPUT: The standard XML output is not printed. Default value is false.
NO_STD_XML_OUTPUT
FORM
ID
Section linguistic_annotation
linguistic_annotation
the section defines the NLP steps that will be used for annotating documents. The values are 0 or 1.
ENABLE_TOKEN: toggles the tokenization step.
ENABLE_TOKEN
ENABLE_NER: toggles the named entity recognition step.
ENABLE_NER
ENABLE_WORD: toogles the word segmentation step.
ENABLE_WORD
ENABLE_SENTENCE: toogles the sentence segmentation step.
ENABLE_SENTENCE
ENABLE_POS: toogles the Part-of-Speech tagging step.
ENABLE_POS
ENABLE_LEMMA: toogles the lemmatization step.
ENABLE_LEMMA
ENABLE_TERM_TAG: toogles the term tagging step.
ENABLE_TERM_TAG
ENABLE_SYNTAX: toogles the parsing step.
ENABLE_SYNTAX
Section NLP_misc
the section defines miscellenous variables for NLP annotation steps.
NLP_resources: the root directory where NLP resources can be found.
NLP_resources
SAVE_IN_OUTDIR: enable or not to save the annotated documents in the outdir directory.
TERM_LIST_EN: the path of the term list for English.
TERM_LIST_EN
TERM_LIST_FR: the path of the term list for French.
TERM_LIST_FR
Section NLP_tools
NLP_tools
This section defines the command line for the NLP tools integrated in the platform.
Additional variables and environment variables can be used for interpolation.
NETAG_EN: command line for the Named Entity Recognizer for English.
NETAG_EN
NETAG_FR: command line for the Named Entity Recognizer for French.
NETAG_FR
WORDSEG_EN: command line for the word segmentizer for English.
WORDSEG_EN
WORDSEG_FR: command line for the word segmentizer for French.
WORDSEG_FR
POSTAG_EN: command line for the Part-of-Speech tagger for English.
POSTAG_EN
POSTAG_FR: command line for the Part-of-Speech tagger for French.
POSTAG_FR
SYNTACTIC_ANALYSIS_EN: command line for the parser for English.
SYNTACTIC_ANALYSIS_EN
SYNTACTIC_ANALYSIS_FR: command line for the parser for French.
SYNTACTIC_ANALYSIS_FR
TERM_TAG_EN: command line for the term tagger for English.
TERM_TAG_EN
TERM_TAG_FR: command line for the term tagger for French.
TERM_TAG_FR
SEMTAG_EN: command line for the semantic tagger for English.
SEMTAG_EN
SEMTAG_FR: command line for the semantic tagger for French.
SEMTAG_FR
Section CONVERTERS
CONVERTERS
This section defines the converters for th MIME types and additional information (see following subsections).
Each line of this section indicates the command line for the corresping MIME types.
Section STYLESHEET
STYLESHEET
This section defines the command lines (the program and the stylesheet) to apply according to the namespace. Each line defines a variable (the name is the namespace), the value is the command line.
A default cammand line is defined by the variable default.
default
This variable defines the default cammand line, i.e. for unknown name space.
SupplMagicFile
This variable indicates the file defining the additional MIME types.
StoreInputFiles
This internal variable indicates if the converted input file are stored in a directory.
Several NLP tools have been integrated in wrappers. In this section, we summarize how to download and install the NLP tools used by default in the Alvis::NLPPlatform::NLPWrappers.pm module. We also give additional information about the tools.
Alvis::NLPPlatform::NLPWrappers.pm
We integrated TagEn as the default named entity tagger.
Form:
sources, binaries and Perl scripts
Obtain:
http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/TagEN.tar.gz
Install:
untar TagEN.tar.gz in a directory go to src directory run compile script
Licence:
GPL
Version number required:
any
Additional information:
This named entity tagger can be run according to various mode. A mode is defined by Unitex (http://www-igm.univ-mlv.fr/~unitex/) graphs. The tagger can be used for English and French texts.
The Word and sentence segmenter we use by default is a awk script sent by Gregory Grefenstette on the Corpora mailing list. We modified it to segmentize French texts.
AWK script
http://www-lipn.univ-paris13.fr/~hamon/ALVIS/Tools/WordSeg.tar.gz
untar WordSeg.tar.gz in a directory
any (modifications for French by Paris 13)
The default wrapper call the TreeTagger. This tool is a Part-of-Speech tagger and lemmatizer.
binary+resources
links and instructions at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Information are given on the web site. To summarize, you need to:
make a directory named, for instance, TreeTagger
Download archives in tools/TreeTagger
go in the directory tools/TreeTagger
Run install-tagger.sh
free for research only
(by date) >= 09.04.1996
We have integrated a tool developed specifically for the Alvis project.It is required while installing the platform.
Perl module
On CPAN, http://search.cpan.org/~thhamon/Alvis-TermTagger-0.3/
perl Makefile.PL make make install
GeniaTagger (POS and lemma tagger):
source+resources
links and instructions at http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/postagger/geniatagger-2.0.1.tar.gz
untar geniatagger-2.0.1.tar.gz in a directory cd tools/geniatagger-2.0.1 Run make
free for research only (and Wordnet licence for the dictionary)
2.0.1
Link Grammar Parser:
sources + resources
http://www.link.cs.cmu.edu/link/ftp-site/link-grammar/link-4.1b/unix/link-4.1b.tar.gz
untar link-4.1b.tar.gz See the Makefile for configuration run make Apply the additional patch for the Link Grammar parser (lib/Alvis/NLPPlatform/patches). cd link-4.1b patch -p0 < lib/Alvis/NLPPlatform/patches/link-4.1b-WithWhiteSpace.diff Similar patch exists for the version 4.1a of the Link Grammar parser
Compatible with GPL
4.1a or 4.1b
BioLG:
http://www.it.utu.fi/biolg/
untar See the Makefile for configuration run make
1.1.11
additional programs
The main characteristic of the NLP platform is its tunability according to the domain (language specificity of the documents to be annotated) and the user requirements. The tuning can be done at two levels:
either resources adapted or describing more precisely the domain can be exploited.
In that respect, tuning concerns the integration of these resources in the NLP tools used in the plaform. The command line in the configuration file can be modified.
Example of resource switching can be found at the named entity recognition step. The default Named Entity tagger can use either bio-medical resources, or more general, according to the value of the parameter -t.
-t
either other NLP tools can be integrated in the NLP platform.
In that case, new wrappers should be written. To make easier, the integration of a new NLP tools, we used the polymorphism to override default wrappers. NLP platform package is defined as a three level hierarchy. The top is the Alvis::NLPPlatform package. The Alvis::NLPPlatform::NLPWrappers package is the deeper. We define the package Alvis::NLPPlatform::UserNLPWrappers as between the both. In that respect, integrating a new NLP tool, and then writing a new wrapper requires to modify methods in the Alvis::NLPPlatform::UserNLPWrappers, and calling or not the default methods.
Alvis::NLPPlatform
Alvis::NLPPlatform::NLPWrappers
Alvis::NLPPlatform::UserNLPWrappers
NB: If the package Alvis::NLPPlatform::UserNLPWrappers is not writable to the user, the tuning can be done by copying the Alvis::NLPPlatform::UserNLPWrappers in a local directory, and by adding this local directory to the PERL5LIB variable (before the path of Alvis::NLPPlatform).
PERL5LIB
NB: A template for the package Alvis::NLPPlatform::UserNLPWrappers can be found in Alvis::NLPPlatform::UserNLPWrappers-template.
Alvis::NLPPlatform::UserNLPWrappers-template
Example of such tuning can be fouond at the parsing level. We integrate a parser designed for biological documents in Alvis::NLPPlatform::UserNLPWrappers.
Requesting a document:
REQUEST
SENDING
SIZE
DONE
ACK
Returning a document:
GIVEBACK
Aborting the annotation process:
ABORTING
Exiting:
the server understands the following messages QUIT, LOGOUT and EXIT. However, this is not been implemented in the client yet.
QUIT
LOGOUT
EXIT
Alvis web site: http://www.alvis.info
Description of the input/output format: http://www.alvis.info/alvis/Architecture_2fFormats?action=show&redirect=architecture%2Fformats#documents
Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Julien Deriviere <julien.deriviere@lipn.univ-paris13.fr>
Copyright (C) 2005 by Thierry Hamon and Julien Deriviere
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.
To install Alvis::NLPPlatform, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Alvis::NLPPlatform
CPAN shell
perl -MCPAN -e shell install Alvis::NLPPlatform
For more information on module installation, please visit the detailed CPAN module installation guide.