Lingua::YaTeA - Perl extension for extracting terms from a corpus and providing a syntactic analysis in a head-modifier format.
use Lingua::YaTeA;
my %config = Lingua::YaTeA::load_config($rcfile);
$yatea = Lingua::YaTeA->new($config{"OPTIONS"}, \%config);
$corpus = Lingua::YaTeA::Corpus->new($corpus_path,$yatea->getOptionSet,$yatea->getMessageSet);
$yatea->termExtraction($corpus);
This module is the main module of the software named YaTeA. It aims at extracting noun phrases that look like terms from a corpus. It provides their syntactic analysis in a head-modifier representation. As an input, the term extractor requires a corpus which has been segmented into words and sentences, lemmatized and tagged with part-of-speech (POS) information. The input file is encoded in UTF-8. The implementation of this term extractor allows to process large corpora. Data provided with YaTeA allow to extract terms from English and French texts. But new linguistic features can be integrated to extract terms from another language. Moreover, linguistic features can be modified or created for a sub-language or tagset.
For the use of YaTeA, see the documentation with the script yatea.
yatea
The main strategy of analysis of the term candidates is based on the exploitation of simple parsing patterns and endogenous disambiguation. Exogenous disambiguation is also made possible for the identification and the analysis of term candidates by the use of external resources, i.e. lists of testified terms.
Endogenous disambiguation consists in the exploitation of intermediate chunking and parsing results for the parsing of a given Maximal Noun Phrase (MNP). This feature allows the parse of complex noun phrases using a limited number of simple parsing patterns (80 patterns containing a maximum of 3 content words in the experiments described below). All the MNPs corresponding to parsing patterns are parsed first. In a second step, remaining unparsed MNPs are processed using the results of the first step as islands of reliability. An island of reliability is a subsequence (contiguous or not) of a MNP that corresponds to a shorter term candidate that was parsed during the first step of the parsing process. This subsequence along with its internal analysis is used as an anchor in the parsing of the MNP. Islands are used to simplify the POS sequence of the MNP for which no parsing pattern was found. The subsequence covered by the island is reduced to its syntactic head. In addition, islands increase the degree of reliability of the parse. When no resource is provided and as there is no parsing pattern defined for the complete POS sequence "NN NN NN of NN" corresponding to the term candidate "Northern blot analysis of cwlH", the progressive method is applied. In such a case, the TC is bracketed from the right to the left, which results in a poor quality analysis. When considering the island of reliability "northern blot analysis", the correct bracketing is found.
load_config($rcfile);
The method loads the configuration of the NLP Platform by reading the configuration file given in argument. It returns the hashtable containing the configuration.
new($command_line_options_h,$system_config_h);
The methods creates a new term extractor and sets oprtions from the command line ($commend_line_options_h) and options defined in the hashtable ($system_config_h) given by address. The methods returns the created object.
$commend_line_options_h
$system_config_h
termExtraction($corpus);
This method applies a extraction process on the corpus $corpus given as parameter, and stores results in the directories specified in the configuration files.
$corpus
setOptions($command_line_options_h);
This method creates an option set. It sets the options defined in the hashtable $command_line_options_h (given by reference) and checks if the language parameter is defined in the configuration.
$command_line_options_h
language
setConfigFiles($this,$system_config_h);
setLocaleFiles($this,$system_config_h);
addOptionsFromFile($this);
setMessageSet($this,$system_config_h);
setTagSet($this);
setParsingPatterns($this);
setChunkingDataSet($this);
setForbiddenStructureSet($this);
loadTestifiedTerms($this,$process_counter_r,$corpus,$sentence_boundary,$document_boundary,$match_type,$message_set,$display_language);
setTestifiedTermSet($this,$filtering_lexicon_h,$sentence_boundary,$match_type);
getTestifiedTermSet($this);
getFSSet($this);
getConfigFileSet($this);
getLocaleFileSet($this);
getResultFileSet($this);
getOptionSet($this);
This method returns the field OPTION_SET.
OPTION_SET
getTagSet($this);
getChunkingDataSet($this);
getParsingPatternSet($this);
getMessageSet($this);
getTestifiedSet($this);
addMessageSetFile($this);
displayExtractionResults($this,$phrase_set,$corpus,$message_set,$display_language,$default_output);
The configuration file of YaTeA is divided into two sections:
Section DefaultConfig
DefaultConfig
CONFIG_DIR : directory containing the configuration files according to the language
CONFIG_DIR
LOCALE_DIR : directory containing the environment files according to the language
LOCALE_DIR
RESULT_DIR : directory where are stored the results (probably not useful)
RESULT_DIR
Section OPTIONS
OPTIONS
language language : Definition of the language of the corpus. Values are either FR (French - TreeTagger output - TagSet <http://www.ims.uni-stuttgart.de/~schmid/french-tagset.html>), FR-Flemm (French - output of Flemm analyser or EN (English - TreeTagger or GeniaTagger output - PennTreeBank Tagset)
FR
FR-Flemm
EN
suffix suffix : Specification of a name for the current version of the analysis. Results are gathered in a specific directory of this name and result files also carry this suffix
suffix
output-path : set the path to the directory that will contain the results for the current corpus (default: working directory)
output-path
termino File : Name of a file containing a list of testified terms. The testified terms have to provided in the TreeTagger output format.
termino
monolexical-all : all occurrences of monolexical phrases are considered as term candidates. The value is 0 or 1.
monolexical-all
monolexical-included : occurrences of monolexical term candidates that appear in complex term candidates are also displayed. The value is 0 or 1.
monolexical-included
match-type [loose or strict] :
match-type
loose : testified terms match either inflected or lemmatized forms of each word
loose
strict : testified terms match the combination of inflected form and POS tag of each word
strict
unspecified option: testified terms match match inflected forms of words
xmlout : display of the parsed term candidates in XML format. The value is 0 or 1.
xmlout
termList : display of a list of terms and sub-terms along with their frequency. To display only term candidates containing more than one word (multi-word term candidates), specify the value multi. All term candidates will be displayed , monolexical and multi-word term candidates with the value all, or if any value is specified.
termList
multi
all
printChunking : displays of the corpus marked with phrases in a HTML file along with the indication that they are term candidates or not. The value is 0 or 1.
printChunking
TC-for-BioLG : annotation of the corpus with term candidates in a XML format compatible with the BioLG software. The value is 0 or 1.
TC-for-BioLG
TT-for-BioLG : annotation of the corpus with testified terms in a XML format compatible with the BioLG software. The value is 0 or 1. (http://www.it.utu.fi/biolg/, biological tuned version of the Link Grammar Parser)
TT-for-BioLG
XML-corpus-for-BioLG : creation of a BioLG compatible XML version of the corpus with PoS tags marked form each word. The value is 0 or 1.
XML-corpus-for-BioLG
debug : displays informations on parsed phrases (i.e. term candidates) in a text format. The value is 0 or 1.
debug
annotate-only : only annotate testified terms (no acquisition). The value is 0 or 1.
annotate-only
TTG-style-term-candidates : term candidates are displayed in TreeTagger output format. Term separator is the sentence boundary tag SENT. To extract only term candidates containing more than one word (multi-word term candidates), specify the option multi. All term candidates will be displayed , monolexical and multi-word term candidates with the value all, or if any value is specified.
TTG-style-term-candidates
SENT
Charlotte Roze has defined the configuration files to process a corpus tagged with Flemm
Wiktoria Golik, Robert Bossy and Claire Nédellec (MIG/INRA) have corrected bugs and improve the mapping of testified terms.
Sophie Aubin and Thierry Hamon. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006). pages 380-387. Tapio Salakoski, Filip Ginter, Sampo Pyysalo, Tapio Pahikkala (Eds). August 2006. LNAI 4139.
Thierry Hamon <thierry.hamon@univ-paris13.fr> and Sophie Aubin <sophie.aubin@lipn.univ-paris13.fr>
Copyright (C) 2005 by Thierry Hamon and Sophie Aubin
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.
To install Lingua::YaTeA, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::YaTeA
CPAN shell
perl -MCPAN -e shell install Lingua::YaTeA
For more information on module installation, please visit the detailed CPAN module installation guide.