The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

yatea - Perl script for extracting terms from a corpus and providing a syntactic analysis in a head-modifier format.

SYNOPSIS

yatea [options]

OPTIONS

--help brief help message
--man full documentation
--rcfile=file read the given configuration file

DESCRIPTION

YaTeA aims at extracting noun phrases that look like terms from a corpus. It provides their syntactic analysis in a head-modifier format. As an input, the term extractor requires a corpus which has been segmented into words and sentences, lemmatized and tagged with part-of-speech (POS) information. The implementation of this term extractor allows to process large corpora. Data provided with YaTeA allow to extract terms from English and French texts. But new linguistic features can be integrated to extract terms from another language. Moreover, linguistic features can be modified or created for a sub-language or tagset.

The main strategy of analysis of the term candidates is based on the exploitation of simple parsing patterns and endogenous disambiguation. Exogenous disambiguation is also made possible for the identification and the analysis of term candidates by the use of external resources, i.e. lists of testified terms.

ENDOGENOUS AND EXOGENOUS DISAMBIGUATION

Endogenous disambiguation consists in the exploitation of intermediate chunking and parsing results for the parsing of a given Maximal Noun Phrase (MNP). This feature allows the parse of complex noun phrases using a limited number of simple parsing patterns (80 patterns containing a maximum of 3 content words in the experiments described below). All the MNPs corresponding to parsing patterns are parsed first. In a second step, remaining unparsed MNPs are processed using the results of the first step as islands of reliability. An island of reliability is a subsequence (contiguous or not) of a MNP that corresponds to a shorter term candidate that was parsed during the first step of the parsing process. This subsequence along with its internal analysis is used as an anchor in the parsing of the MNP. Islands are used to simplify the POS sequence of the MNP for which no parsing pattern was found. The subsequence covered by the island is reduced to its syntactic head. In addition, islands increase the degree of reliability of the parse. When no resource is provided and as there is no parsing pattern defined for the complete POS sequence "NN NN NN of NN" corresponding to the term candidate "Northern blot analysis of cwlH", the progressive method is applied. In such a case, the TC is bracketed from the right to the left, which results in a poor quality analysis. When considering the island of reliability "northern blot analysis", the correct bracketing is found.

SEE ALSO

Sophie Aubin and Thierry Hamon. Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing (5th International Conference on NLP, FinTAL 2006). pages 380-387. Tapio Salakoski, Filip Ginter, Sampo Pyysalo, Tapio Pahikkala (Eds). August 2006. LNAI 4139.

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr> and Sophie Aubin <sophie.aubin@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2005 by Thierry Hamon and Sophie Aubin

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.