Treex::Tutorial::FirstSteps - First steps after installing Treex
The elementary unit of code in Treex is called block. Each block should solve some well defined and usually lingustically motivated task, e.g. tokenization, tagging or parsing.
A sequence of blocks is called
scenario and it can describe end-to-end NLP application,
machine translation or preprocessing of a parallel treebank.
Treex applications can be executed from Perl.
usually they are executed using the command line interface
We will start traditionally with the "Hello, world!" example :-).
echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en
The desired output was printed to STDOUT, but there are some info messages around printed to STDERR. To filter out these messages you can either use the
--quiet option (
-q) or the standard redirection of STDERR.
echo 'Hello, world!' | treex -q Read::Text language=en Write::Text language=en echo 'Hello, world!' | treex Read::Text language=en Write::Text language=en 2>/dev/null
Read::Text language=en Write::Text language=en is a scenario definition. The scenario consists of two blocks: Read::Text and Write::Text. Each block has one parameter set, the name of the parameter is
language and its value is
en (which is an ISO 639-1 code for English).
One Treex document can contain sentences in more languages (which is useful for tasks like word alignment or machine translation), so it is necessary to instruct each block on which language it should operate.
It is not necessary to repeat the same parameter specification for every block. You can use a special block
echo 'Hello, world!' | treex -q Util::SetGlobal language=en Read::Text Write::Text
Yes. (And I know the previous example was not actually shorter.) There is an option
-L) which is just a shortcut for
echo 'Hello, world!' | treex -q --language=en Read::Text Write::Text echo 'Hello, world!' | treex -q -Len Read::Text Write::Text
The "Hello, world!" example is silly. The first block (so-called reader) reads the plain text input, converts it to the Treex in-memory document representation and this document is passed to the second block (so-called writer) that converts it to plain text and prints to STDOUT. No (linguistic) processing was done.
There are readers and writers for various other formats than plain text (e.g. HTML, CoNLL, PennTB MRG, PDT PML), so you can use it for format conversions (see Treex::Tutorial::ReadersAndWriters). You can also create you own readers and writers for new formats (see Treex::Tutorial::WritingNewReaders).
For simplicity, we'll continue to use plain text format in this tutorial chapter, but we'll try to do something slightly more interesting.
To segment a text into sentences we can use block
W2A::Segment and writer
Write::Sentences that prints each sentence on a separate line.
echo "Hello! Mr. Brown, how are you?" \ | treex -Len Read::Text W2A::Segment Write::Sentences
You can see, that the text was segmented into three sentences: "Hello!", "Mr.", and "Brown, how are you?". Block
W2A::Segment is language independent (at least for languages using Latin alphabet) and it finds sentence boundaries just based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a capital letter. To get the correct segmentation we must use
W2A::EN::Segment which has a list of English words (tokens) that usually do not end a sentence even if they are followed by a fullstop and a capital letter. By the way, Treex is object-oriented, blocks are classes and
W2A::EN::Segment is a descendant of the
W2A::Segment base class.
echo "Hello! Mr. Brown, how are you?" \ | treex -Len Read::Text W2A::EN::Segment Write::Sentences
The blocks are actually Perl modules and if you followed Tutorial::Install you can find them in ~/perl5/lib/perl5/Treex/Block/. Generally, you can find the real location of perl module with
perldoc -l. The full name of the
W2A::EN::Segment module is actually
Treex::Block::W2A::EN::Segment, but since the prefix
Treex::Block:: is common to all blocks, it is not written in the scenario description. So the location of
W2A::EN::Segment can be found with
perldoc -l Treex::Block::W2A::EN::Segment
All Treex blocks that do shallow linguistic analysis (segmentation, tokenization, lemmatization, PoS tagging, dependency parsing) are grouped in a directory
W2A (W and A are names of the two layers of language description). Language specific blocks are stored in a subdirectory with a uppercase ISO code of the given language (
EN) in this case.
If you have
sample.txt with one sentence per line, you can load it to Treex using
cat sample.txt | treex -Len Read::Sentences ...
You have an input plain text (e.g. data/news.txt) where each paragraph (including headlines) is on a separate line. Load this file into Treex and print one sentence per line. Note that headlines do not end with a fullstop, but they should be treated as separated sentences
HINT: See documentation of Treex::Block::W2A::Segment.
Try these scenarios and check the differences:
echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences W2A::TokenizeOnWhitespace Write::CoNLLX echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences W2A::Tokenize Write::CoNLLX echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences W2A::EN::Tokenize Write::CoNLLX echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences W2A::EN::TagLinguaEn Write::CoNLLX
Now, the fourth column in CoNLLX format contains PoS (part-of-speech) tags, but the tokenization is different than with
W2A::EN::Tokenize. The reason is that W2A::EN::TagLinguaEn is actually a thin wrapper for a popular Perl module Lingua::EN::Tagger, which does tokenization and tagging in one step.
We can try another tagger which is better suited for the modularity idea. Note that
W2A::TagTreeTagger was not released on CPAN yet (it is a wrapper for a binary of TreeTagger), and to use Featurama tagger you must first install it as described in the Installation Guide.
echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences\ W2A::EN::Tokenize\ W2A::TagTreeTagger\ W2A::EN::Lemmatize\ Write::CoNLLX
Now, the third column contains lemmas, but the tags are not from the standard PennTB tagset. As a result, also the lemmas for proper nouns are lowercased (because W2A::EN::Lemmatize expect NNP tag for proper nouns). For English and Czech, Treex offers a pre-trained model for a high-quality Featurama tagger. For many other languages, there are pre-trained TreeTagger models.
echo "Mr. Brown, we'll start tagging." |\ treex -Len Read::Sentences\ W2A::EN::Tokenize\ W2A::TagFeaturama\ W2A::EN::Lemmatize\ Write::CoNLLX echo "Es tut mir leid." |\ treex -Lde Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX echo "Lo siento" |\ treex -Les Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX echo "Mi dispiace" |\ treex -Lit Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX echo "Je suis desolée" |\ treex -Lfr Read::Sentences W2A::Tokenize W2A::TagTreeTagger Write::CoNLLX echo "Bohužel jsem tento tutorial nedokončil." |\ treex -Lcs Read::Sentences W2A::CS::Tokenize W2A::TagFeaturama Write::CoNLLX
If you suceeded to install Morče tagger, you can substitute
Implement a rule-based (or statistical if you dare) tokenizer for a language of your choice. We will be happy to give you write permission to Treex SVN, so you can commit the block.
HINT: Find the source code of W2A::EN::Tokenize, copy it to W2A/XY/Tokenize.pm (substitute XY for the uppercased ISO code of your language), and change the implementation of
echo "John loves Mary" |\ treex -Len Read::Sentences\ W2A::EN::TagLinguaEn\ W2A::EN::Lemmatize\ W2A::EN::ParseMSTperl\ Write::CoNLLX
Try to use different taggers and find sentences where different tagging leads to different parsing. You can also try to use different parsers:
W2A::EN::ParseMST (the original R. McDonald's implementation), or
W2A::EN::ParseMalt (use parameter
memory=1g), but those blocks (and wrappers for the Java implementation) are not released on CPAN yet, so if you are not using the preinstalled Treex in SU2, you may need to install the parsers first.
Treex native format *.treex.gz is a actually a gzipped XML. During the following section on readers and writers, look inside the files (
zless my.treex.gz. Check what happends when lemmatization is added. Try to continue in the analysis and add tagging and parsing. Visualize the individual steps using TrEd with
So far, we have printed all the results to STDOUT in CoNLLX format. You can easily forward the output to a file using a standard redirection, but you can also specify the output file with a writer's parameter
echo "John loves Mary" | treex -Len Read::Sentences W2A::EN::TagLinguaEn\ Write::CoNLLX to=my.conll
There is a safety check against accidental overwriting files, so if you want to overwrite my.conll, you must use
Write::CoNLLX to=my.conll clobber=1.
Similarly, you can specify the input files with a reader's parameter
treex -Len Read::CoNLLX from=my.conll Write::Sentences
from can contain a list of comma or space separated files. If a file starts with @ character, it is interpreted as a file list with one file name per line. So you can do e.g.:
ls data/pcedt*.treex.gz > my.list treex -Len Read::Treex firstname.lastname@example.org Write::Sentences to=out.txt
For Treex file format (*.treex or *.treex.gz) there is a shortcut, which automatically adds the reader to the beginning of the scenario.
treex -Len Write::Sentences -- data/pcedt*.treex.gz
You can use treex CLI as an format convertor.
treex -Len Read::CoNLLX from=my.conll Write::Treex to=my.treex.gz
Finally, there is another shortcut, that allows you to modify treex files in situ.
treex -s -Len W2A::EN::Lemmatize -- my.treex.gz # check that lemmas were added treex -Len -q Write::CoNLLX -- my.treex.gz
Now, you can try to do some programming task, both templates (containing specifications) and solutions are provided in
Treex::Block::Tutorial. In SU2 lab, please update your local copy:
cp -r ~popem3am/treex_tutorial/Treex ~/treex_tutorial/
Martin Popel <email@example.com>
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.