Joerg Tiedemann > Lingua-Align > doc/index.pod

Download:
Lingua-Align-0.04.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

Lingua::Align - a toolbox for Tree Alignment ^

Lingua::Align is a collection of command-line tools for automatic tree and word alignment of parallel corpora. The main purpose is to provide an experimental toolbox for experiments with various feature sets and alignment strategies. Alignment is based on local classification and alignment inference. The local classifier is typically trained on available aligned training data. We use a log-linear model for discriminative binary classification using the maximum entropy learning package megam (Hal Daume III).

Download

Lingua::Align is available from here: https://bitbucket.org/tiedemann/lingua-align

Installation

You can either install the perl modules and binaries as usual:

   perl Makefile.PL
   make
   make install

Or you can simply run the treealign script (and the other tools) in the bin/ directory without changing anything. The only requirement is a recent version of Perl and XML::Parser installed on your system (the Perl wrapper for the Expat XML parser).

The Tree Aligner calls an external tool (megam) which is provided as a pre-compiled binary in the bin/ directory. The default version is a i686 binary for Linux-based systems. The package also includes a binary for Intel-based Mac OS X. If you want to use this version, please change the link bin/megam to point to bin/megam.osx. For all other platforms please download the source from http://www.cs.utah.edu/~hal/megam/ and compile it on you platform. Make sure that the binary works and link it to bin/megam.

For some features you will need word alignment information. To produce these features you need to run tools such as Giza++ http://code.google.com/p/giza-pp/ and Moses http://statmt.org/moses/.

Quickstart Tutorial ^

The easiest way to use the Tree Aligner is to run the frontend script treealign in the bin directory. There are many options and command-line arguments that can be used to adjust the behaviour of the alignment tools. Have a look at Lingua::treealign for more information.

Run tests on existing data sets

For a simple test: go to the directory europarl and run make test.

  cd europarl
  make test

This will run a simple test with only a few training sentences from the Europarl corpus and simple features for classification. The test consists of two calls to tree aligner scripts: treealign is used to train a classifier and to align unseen sentences from the given data set. treealigneval is used to compute scores of the alignment performed with that model and the alignment strategy that is chosen.

The example training data is stored in europarl/nl-en-weak_125.xml which has been produced by manual alignment (thanks to Gideon Kotzé) using the Stockholm Tree Aligner http://kitt.cl.uzh.ch/kitt/treealigner. The format looks like this:

 <?xml version="1.0" encoding="UTF-8"?>
 <treealign subversion="3" version="2">
 <head>
 ...
   <treebanks>
     <treebank id="en" filename="ep-00-12-15.125.en.tiger"/>
     <treebank id="nl" filename="ep-00-12-15.125.nl.tiger"/>
   </treebanks>
 ...
 </head>
 <alignments>
   <align type="good" last_change="2010-03-29" author="Gideon">
     <node treebank_id="en" node_id="s5_501"/>
     <node treebank_id="nl" node_id="s10_0"/>
   </align>
   <align type="good" last_change="2010-03-29" author="Gideon">
     <node treebank_id="en" node_id="s5_502"/>
     <node treebank_id="nl" node_id="s10_1"/>
   </align>
 ...

The actual treebank data is stored in TigerXML (in this case) and links are pointers to these documents using the unique node IDs. This should be quite straightforward (looking at the example above). Other formats are also supported, for example, Penn Treebank format and AlpinoXML for storing treebanks. There is also support for other alignment formats like the tree alignment format used by the Dublin Subtree Aligner and word alignment formats used by Giza++, Moses and shared tasks on word alignment (WPT2003).

Run with your own settings

Basically you can call the tree-aligner frontend with your own data and settings like this:

  treealign -a <ALIGNFILE> -f <FEATURES> -n <NR_TRAIN_SENT> -e <NR_TEST_SENT>

The alignment file ALIGNFILE has to contain the tree alignments that will be used for training the classifier. The default format is the one explained above (similar to the one used by the Stockholm Tree Aligner). FEATURES is a string specifying the features to be used in classification. NR_TRAIN_SENT is the number of sentences to be used for training and NR_TEST_SENT is the number of test sentences. There are many more options that can be set on the command line. Please look at Lingua::treealign for more information.

Of course it is also possible to align treebanks using an existing alignment model. The only thing you need are the treebank files in both languages which have to be sentence aligned. Assuming that the alignment model is stored in the default file 'treealign.megam' and the two treebanks (ep-00-12-15.125.en.penn, ep-00-12-15.125.nl.penn from the sample files in europarl/) are stored in bracketed Penn Treebank format you can call the aligner like this:

  treealign -s ep-00-12-15.125.en.penn -S penn \
            -t ep-00-12-15.125.nl.penn -T penn \
            -m treealign.megam > alignments

This will assume that trees from both treebanks are aligned with each other in the same order as they appear in the given files (corresponding lines in this case). Features to be used for classification have to be stored in treealign.megam.feat (they should be if treealign.megam has been produced by Lingua-Align). Tree alignments will be stored in alignments in STA format.

Here are some more details about the things you need for running your own experiments:

Training data

Your own tree-aligned training data. The easiest way is to use the Stockholm Tree Aligner. The format produced by this tool can directly be used by Lingua::Align. You need at least 100 pairs of parse trees in order to obtain reasonable results. More is better of course. The corpus has to be parsed on both sides. You need to use TigerXML for the Stockholm Tree Aligner and this is also most convenient for the tree aligner later on (also for visualizing automatic alignment).

There is a tool to convert treebanks using the formats supported by Lingua::Align: bin/convert_treebank. For example, if your parse trees are stored in Penn Treebank format (treebank) you might try to use the following command:

  convert_treebank treebank penn tiger > treebank.tiger

Hopefully this will work to create a corpus that can be loaded into the Stockholm Tree Aligner. You might have to check the specifications in the XML header and adjust some (meta) information. You can, for example, validate your Tiger-XML against the schema:

  xmllint --schema http://www.cl.uzh.ch/kitt/treealigner/data/schema/TigerXML.xsd --noout <your-tiger-file>

You can also use another format which is similar to the one used by the Dublin Tree aligner which applies a bracketed format for tree structures and links in terms of references to the nodes in these trees. Here is an example (from europarl/nl-en_125.dublin):

 (ROOT-1 (S-2 (VP-3 (VBP-4 Are)(RB-5 there)(NP-6 (DT-7 any)(NNS-8 comments)))(.-9 ?)))
 (top-1 (np-2 (det-3 Geen)(noun-4 bezwaren)(punct-5 ?)))
 1 1 6 2 7 3 8 4 9 5

The first row is the source language tree, the second one is the target language row and the third one contains the links between source and target nodes. This format is not entirely compatible with the Dublin Tree Aligner format as it does not support conflated unary productions.

If you like to use this format for your training data you can call the tree-aligner script with an extra parameter (-A) specifying the alignment format, for example:

   treealign -a nl-en_125.dublin -A dublin -f catpos -n 10 -e 10 > align

Features & parameters

First of all, you need to decide what kind of features should be used in the classifier model. Quite a lot of features are supported by Lingua::Align and you can easily add new ones. Look at Lingua::Align::Features for more information on classification features. Features are given as a list of features type names separated by ':' using the command-line flag '-f'. Here are some simple exampels assuming that the training corpus is stored in Stockholm Tree Aligner format in a file called nl-en_125.xml (from europarl/):

  treealign -a nl-en_125.xml -f catpos -n 10 -e 10 > aligned.xml

This will train a model on the first 10 tree pairs using the catpos feature (pairs of category or POS labels) and then align the following 10 tree pairs. The classification model is stored in the default file treealign.megam. Have a look at the parameters if you like (it's just a plain text file). For other training options default settings are used (check the Lingua::treealign script for more details). Also for alignment standard settings are applied. The tree aligner will perform a two-step strategy using local classification and a greedy alignment inference.

The result is printed to STDOUT and piped to aligned.xml in the example above. Use the following command to evaluate the alignment just done:

  treealigneval nl-en_125.xml aligned.xml

This should give you very low precision and recall values (around 25%; Note that individual recall values for terminal nodes and non-terminal nodes are not correct because node type information is not available in the gold standard and IDs do not follow the standard to make a clear distinction).

Another simple example using two feature types ("tree level similarity" and "tree span similarity") is the following:

  treealign -a nl-en_125.xml -f treelevelsim:treespansim -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

The classification model will look something like this:

  **BIAS** -8.23925781250000000000
  treespansim 3.54736995697021484375
  treelevelsim 3.76471185684204101562

Still, these features are not very informative and the scores will be still very low. Try now a combination of the three feature types mentioned above:

  treealign -a nl-en_125.xml -f treelevelsim:treespansim:catpos \
       -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

This will give you much better results already (around 50% F-scores).

Now you can start to experiment with contextual features, for example, catpos features of parent and children nodes:

  treealign -a nl-en_125.xml \
     -f treelevelsim:treespansim:catpos:parent_catpos:children_catpos \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

And, surprise, this gives you another improvement (around 60% F-score). You can also get features from neighboring nodes using 'sister_' and 'neighborXY_' as prefix. With 'sister_' features will be extracted from ALL sister nodes, i.e. nodes that have the same parent. In case of real-valued features the average (arithmetic mean) of the feature values of these sister nodes will be used. For binary feature templates (for example 'catpos') all of them will be included. (This is exactly the same behaviour for 'children_').

The 'neighborXY_' prefix is more flexible. You can specify neighbors using X as the distance in the source language tree and Y as the distance in the target language. Negative values will be interpreted as left neighbors and positive values (don't use '+'!) for neighbors to the right. For terminal nodes: All surface words will be considered for retrieving neighbors. For nonterminals: only neighboring nodes with the same parent as the current node will be considered! Observe that the distances have to be less than 10 because the pattern only allows single digits! Here is an example for the use of neighbor feature:

  treealign -a nl-en_125.xml \
     -f catpos:neighbor-10_catpos:neighbor-11_catpos \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

This retrieves the 'catpos' feature from the current node pair, from the left source tree neighbor together with the current target node, and from the left source tree neighbor together with the right target node neighbor.

Note that these models so far do not use any other information than the features directly extracted from the parse trees and the alignment information available in the training data. There are also features that need external resources. For example, you may include word alignment information for the tree alignment. For this you need to run automatic word alignment first (on the treebank sentences you're using in your experiments) and you need to store the information in the format supported by Lingua::Align. You may use the Viterbi alignment produced by Giza++ (moses/giza.src-trg/src-trg.A3.final.gz and moses/giza.trg-src/trg-src.A3.final.gz):

  treealign -a nl-en_125.xml \
     -g moses/giza.src-trg/src-trg.A3.final.gz \
     -G moses/giza.trg-src/trg-src.A3.final.gz \
     -f gizae2f:gizaf2e \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

You can see how effective these features are for tree alignment (well, at least they give you already around 55% F-scores with the tiny training data we are using in our examples). Of course, you can use word alignment features from context nodes as well (giving you around 65% F-scores):

  treealign -a nl-en_125.xml \
     -g moses/giza.src-trg/src-trg.A3.final.gz \
     -G moses/giza.trg-src/trg-src.A3.final.gz \
     -f gizae2f:gizaf2e:parent_giza:children_giza \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

Note that we use a combination of gizae2f and gizaf2e for the context nodes. Now try a combination of all features we mentioned so far. You should get a decent score of around 74% F-score. Nice, isn't it?

Another word alignment feature is based on the symmetrized alignments produced by Moses. Use them in the following way:

  treealign -a nl-en_125.xml \
     -y moses/model/aligned.intersect \
     -f moses:parent_moses:children_moses \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

Don't ask me why the parameter is '-y' for the Moses alignment file. (It's basically because treealign only uses short command-line options and I was running out of letters ....)

Actually, you could leave out the file specifications in the examples above because we were just using the default names and paths. You can use the flag (-M moses-dir) if the file-names and sub-directories are the same but the main Moses work-directory is different (for example my-moses-dir):

  treealign -a nl-en_125.xml \
     -M my-moses-dir \
     -f moses:parent_moses:gizaf2e \
     -n 10 -e 10 > aligned.xml
  treealigneval nl-en_125.xml aligned.xml

Finally, we should introduce history features. For now we just did local classification without considering alignment decisions on other nodes. The classifier can also be trained with so-called history features -- features based on previous decisions. Using such features will force the tree aligner to use a sequential classification procedure, either bottom-up or top-down. In top-down classification will start with the root-nodes and the classifier uses alignment decisions on the parent nodes as additional features. You can use these so-called parent features like this (adding the flag -P to the command-line):

  treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
            -n 10 -e 10 -P > aligned.xml

Compare this to the alignment without the -P flag and you will see the difference when running evaluation. In bottom-up classification, two types of history features are supported: proportion of links between immediate children nodes (-C) and proportion of links between all children nodes in the entire subtrees (-U).

  treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
            -n 10 -e 10 -C -U > aligned.xml

Note that history features coming from parent links and coming from children cannot be combined (for obvious reasons). And don't expect improvements in all cases. Especially for rich feature sets no big improvements can be expected. Note that alignment will also be (even) slower.

Alignment strategies

In the default settings a two-step procedure is used: First all node pairs are classified using the local classifier, possibly including history features. The second step comprises the actual alignment step (inference) in which nodes are linked to each other according to the link likelihoods assigned by the local classifier in the first step. The default strategy is a "greedy" alignment procedure, starting with the node pair with the highest link likelihood and running greedily through the set of candidates. A necessary constraint is that all nodes are aligned at most once (on both sides).

You can use other strategies for example using an additional well-formedness constraint:

  treealign -a nl-en_125.xml \
            -y moses/model/aligned.intersect \
            -f moses:parent_moses:children_moses \
            -l GreedyWellformed \
            -n 10 -e 10 -C -U > aligned.xml

Compare this to the result obtained with the standard strategy (-l greedy). Another common technique is to use graph-theoretic algorithms modeling tree alignment as a maximum weighted bipartite matching problem. Lingua::Align includes a free implementation of the Hungarian algorithm (Kuhn-Munkres) that solves this problem in polynomial time.

  treealign -a nl-en_125.xml \
            -y moses/model/aligned.intersect \
            -f moses:parent_moses:children_moses \
            -l munkres \
            -n 10 -e 10 -C -U > aligned.xml

Several other inference strategies can be used. The documentation of the ones included in Lingua::Align is still rather unexisting. Look at the code in the module Lingua::Align::LinkSearch for more information.

You can also do without simply using the decisions of the local classifier (default: scores above 0.5 indicate a link):

   treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
             -n 10 -e 10 -P -l threshold > aligned.xml

For simple feature sets these scores will be much lower. Alignment constraints such as the one-to-one link constraint and well-formedness of links are important in those cases. For richer feature sets this difference fades away.

One problem with the greedy strategies is that alignment is slow because all node pairs have to be considered as candidates for classification (and alignment). This is because "feature extraction" is actually the bottle neck in the entire alignment procedure (not classification nor alignment inference). There is a way to speed this up by combining local classification with a greedy alignment strategy. This is (again) called 'bottom-up' alignment but this time using classifier scores immediately for establishing links between nodes. Alignment starts at the leaf nodes and each node pair that receives a score above 0.5 will be aligned immediately (and not considered aferwards anymore). After this greedy bottom-up procedure the chosen alignment inference strategy will be used for the remaining unlinked nodes. Use the option -b bottom-up):

   treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e -n 10 -e 10 -C -U -v \
             -l GreedyWellformed -b bottom-up > aligned.xml

Observe that we can use history features again (but not -P). This should speed-up the alignment process a bit (not that much as you might have expected ...). You can get information about the runtime by including the verbose output flag (see above -v).

Library structure

There are several options that can be set. For further information have a look at the manpages linked below or just look at the source code. Extending the code is quite straightforward even though the documentation is not perfect and the code is partially awful (well, it's Perl ....). Here is a (hopefully up-to-date) list of modules (many of them are under-developed / experimental / non-functioning possible projects for the future):

top-level modules:
  Lingua/Align.pm
  Lingua/Align/Trees.pm
modules for feature extraction
  Lingua/Align/Features.pm
  Lingua/Align/Features/Cooccurrence.pm
  Lingua/Align/Features/Lexical.pm
  Lingua/Align/Features/Alignment.pm
  Lingua/Align/Features/Tree.pm
  Lingua/Align/Features/Orthography.pm
  Lingua/Align/Features/History.pm
modules for classification
  Lingua/Align/Classifier.pm
  Lingua/Align/Classifier/Megam.pm
  Lingua/Align/Classifier/Clues.pm
  Lingua/Align/Classifier/LibSVM.pm
modules for alignment inference
  Lingua/Align/LinkSearch.pm
  Lingua/Align/LinkSearch/Assignment.pm
  Lingua/Align/LinkSearch/AssignmentWellFormed.pm
  Algorithm/Munkres.pm
  Lingua/Align/LinkSearch/Cascaded.pm
  Lingua/Align/LinkSearch/GreedyFinalAnd.pm
  Lingua/Align/LinkSearch/GreedyFinal.pm
  Lingua/Align/LinkSearch/Greedy.pm
  Lingua/Align/LinkSearch/GreedyWellFormed.pm
  Lingua/Align/LinkSearch/Intersection.pm
  Lingua/Align/LinkSearch/NTFirst.pm
  Lingua/Align/LinkSearch/NTonly.pm
  Lingua/Align/LinkSearch/PaCoMT.pm
  Lingua/Align/LinkSearch/Src2Trg.pm
  Lingua/Align/LinkSearch/Src2TrgWellFormed.pm
  Lingua/Align/LinkSearch/Threshold.pm
  Lingua/Align/LinkSearch/Tonly.pm
  Lingua/Align/LinkSearch/Trg2Src.pm
  Lingua/Align/LinkSearch/Viterbi.pm
modules for data manipulation
  Lingua/Align/Corpus.pm
  Lingua/Align/Corpus/Treebank.pm
  Lingua/Align/Corpus/Treebank/AlpinoXML.pm
  Lingua/Align/Corpus/Treebank/Penn.pm
  Lingua/Align/Corpus/Treebank/Stanford.pm
  Lingua/Align/Corpus/Treebank/TigerXML.pm
  Lingua/Align/Corpus/Parallel/Bitext.pm
  Lingua/Align/Corpus/Parallel/Dublin.pm
  Lingua/Align/Corpus/Parallel/Giza.pm
  Lingua/Align/Corpus/Parallel/Moses.pm
  Lingua/Align/Corpus/Parallel/OPUS.pm
  Lingua/Align/Corpus/Parallel/OrderedIds.pm
  Lingua/Align/Corpus/Parallel.pm
  Lingua/Align/Corpus/Parallel/STA.pm
  Lingua/Align/Corpus/Parallel/WPT.pm
  Lingua/Align/Corpus/Factored.pm

How to do word alignment ^

Lingua::Align can, of course, also be used for word alignment. It is straightforward if you have parse trees available. Then you can just specify the flag -L (leafs only) to only consider terminal nodes during training and alignment. (Note that you can also align non-terminal nodes only using the flag -N and if you use both flags -N -L only nodes of the same type will be aligned).

Furthermore, you can also use the software to do word alignment on plain text files (this is still quite experimental). Look at the example in europarl/wpt03 to see how to run the aligner. Again, you need some training data and you have to specify some features to be used for classification. Training data can be in the format of the shared task on word alignment WPT 2003/2005 (http://www.cse.unt.edu/~rada/wpt/):

  0008 4 2 S
  0008 1 1 P
  0008 2 1 P
  0008 3 1 P

As features you may use, for example, string similarity measures such as LCSR score (longest common sub-sequence ratio), Dice scores based on co-occurrence frequencies, Moses/Giza++ alignments, binary features such as the occurrence of suffix pairs etc. Run make in the europarl/wpt03 to see an example alignment experiment.

To run your own experiments you can specify your own setup. Here is a simple example:

  treealign -a test.wa.nullalign -A wpt \
            -s test.e -S text \
            -t test.f -T text \
            -f lcsr=3:suffix=4:treespansim \
            -n 20 -e 20 -L > aligned.xml

This uses the file test.wa.nullalign for training and testing which is in WPT format (-A) and aligns source languages texts (test.e) to the target language texts (test.f), both in plain text format. The features are string similarity (LCSR) between tokens that are at least 3 characters long, pairs of 4-character suffixes and "tree span similarity", which is in case of word alignment the relative position difference between the token witin the sentences.

For evaluation you can use the standard evaluation script just specifying that the gold standard is in WPT format:

  treealigneval -g wpt test.wa.nullalign aligned.xml

If you want to use Moses/Giza++ alignments as features: Just use the same parameters as for tree alignment.

  treealign -a test.wa.nullalign -A wpt \
            -s test.e -S text \
            -t test.f -T text \
            -g moses/giza.e-f/A3.final.447.gz \
            -G moses/giza.f-e/A3.final.447.gz \
            -y moses/model/aligned.grow-diag-final.447 \
            -f moses:gizae2f:gizaf2e
            -n 20 -e 20 -L > aligned.xml

Another common feature is co-occurrence which can be measured in various ways. You can use the script bin/coocfreq to generate co-occurrence frequencies from arbitrary parallel corpora that can be plugged into the aligner as a feature. An example computing co-occurrence frequencies from tokens in the test corpus (which is much too small to compute reliable scores) is the following:

  coocfreq -s test.e -t test.f \
           -x word -y word \
           -f word.src -e word.trg -c word.cooc

This uses the parallel corpus test.e and test.f (in Moses/Giza++ plain text format -- corresponding lines are aligned to each other) to count frequencies that will be stored in word.src (source language tokens) word.trg (target language tokens) and word.cooc (co-occurrence frequencies). These scores can then be used in the aligner as a feature:

  treealign -a test.wa.nullalign -A wpt \
            -s test.e -S text \
            -t test.f -T text \
            -f dice=word.cooc \
            -n 20 -e 20 -L > aligned.xml

Don't expect too much as these Dice scores are not reliable from such a small corpus! Of course, you can combine these scores with any other feature as described above.

Co-occurrence frequencies can be computed for various kinds of features and feature combinations. For example, you can compute frequencies of word suffixes with the following command:

  coocfreq -s test.e -t test.f \
           -x suffix=4 -y suffix=4 \
           -f suffix.src -e suffix.trg -c suffix.cooc

In order to use several Dice scores in alignment you can give these feature types different names (they have to start with 'dice'):

  treealign -a test.wa.nullalign -A wpt \
            -s test.e -S text \
            -t test.f -T text \
            -f diceword=word.cooc:dicesuffix=suffix.cooc \
            -n 20 -e 20 -L > aligned.xml

It is maybe worth mentioning that these feature types (Dice, LCSR, suffix-pairs, etc) also can be used for tree alignment as explained earlier. Especially Dice scores can also be calculated for any feature connected to arbitrary nodes in a tree. Examples of such co-occurrence measures can be seen in the smultron/ directory. Here is an example for computing co-occurrence frequencies for POS labels and parent category labels from parse tree pairs:

  coocfreq -a sophie.xml -A sta \
           -x pos:parent_cat -y pos:parent_cat \
           -f pospcat.src -e pospcat.trg -c pospcat.cooc

In tree alignment it would also make sense to use the contextual co-occurrence features, for example, -f parent_dicecat=cat.cooc (if cat.cooc includes the co-occurrence frequencies of category labels).

Finally, you can also visualize word alignment using a little tool in the bin directory of Lingua::Align:

  compare_wordalign.pl -A corpus \
                -b wpt -B corpus -S test.e -T test.f \
                aligned.xml test.wa.nullalign

This will print link matrices comparing the proposed links with the gold standard links. It also computes cumulative evaluation measures (precision, recall, AER). It looks like this:

  compare word alignments for: 0157 -- 0157
  -----------------------------------------------|--
                                   · · · · · ·   | he
                                   · · · · · ·   | said
                                   · · · · · ·   | that
   S                                             | if
     S                                           | we
       S                                         | use
         · S                                     | unemployment
             z ·                                 | as
             · · · ·                 *           | the
             · · · ·               *             | solution
             · z · ·                             | to
               · · S                             | inflation
                     S                           | ,
                       S · · · ·                 | we
                       · z · · ·                 | will
                       · · z · ·                 | get
                       · · · · P                 | recovery
                                               S | .
  -----------------------------------------------|--
   s n u l c p c l i , n p r l é , v l p d l g .
   i o t e h o o e n   o o e e c   o a o e e o  
     u i   ô u m   f   u u l   o   i   s     u  
     s l   m r b   l   s r a   n   l   i     v  
       i   a   a   a     r n   o   à   t     e  
       s   g   t   t     o c   m       i     r  
       o   e   t   i     n e   i       o     n  
       n       r   o     s r   e       n     e  
       s       e   n                         m  
                                             e  
                                             n  
                                             t  
     8 x (S) .... proposed = gold = S
     1 x (P) .... proposed = gold = P
     4 x (z) .... proposed = P, gold = S (ok!)
     0 x (d) .... proposed = S, gold = P (ok!)
  !  2 x (*) .... proposed = S, gold = not aligned (wrong!)
     0 x (+) .... proposed = P, gold = not aligned (wrong!)
     0 x (-) .... proposed = not aligned, gold = S (missing!)
    49 x (·) .... proposed = not aligned, gold = P (missing!)
  ----------------------------------------------------------------  
  total: 13 correct, 49 missing, 2 wrong
  this sentence: precision = 0.8667, recall = 1.0000, AER = 0.0741
        average: precision = 0.8338, recall = 0.9613, AER = 0.1150
          total: precision = 0.8526, recall = 0.9695, AER = 0.0941

Documentation & References ^

There are several man-pages generated from the "pod" information in the Perl modules and scripts included in Lingua::Align. Look at the following files:

Lingua::treealign

Lingua::treealigneval

Lingua::Align

Lingua::Align::Trees

Lingua::Align::Features

Lingua::Align::LinkSearch

Lingua::Align::Corpus

Lingua::coocfreq

Lingua::convert_treebank

Lingua::sta2moses

Lingua::sta2phrases

Here are some publications (please cite if you use the software):

Tiedemann, J. (2010)

Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree Alignment. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC'2010), 2010. http://stp.lingfil.uu.se/~joerg/published/lrec2010.pdf

  @InProceedings{Tiedemann:LREC10,
    author =     {Jörg Tiedemann},
    title =      {Lingua-Align: An Experimental Toolbox for Automatic
                  Tree-to-Tree Alignment},
    booktitle =  {Proceedings of the 7th International Conference on
                  Language Resources and Evaluation (LREC'2010)},
    year =       2010,
    address =    {Valetta, Malta},
  }
Tiedemann, J. and Kotzé, G. (2009)

Building a Large Machine-Aligned Parallel Treebank. In Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT'08), pages 197-208, EDUCatt, Milano/Italy, 2009. http://stp.lingfil.uu.se/~joerg/published/tlt09.pdf

 @InProceedings{TiedemannKotze:TLT09,
   author =      {Jörg Tiedemann and Gideon Kotzé},
   title =       {Building a Large Machine-Aligned Parallel Treebank},
   booktitle =   {Proceedings of the 8th International Workshop on
                  Treebanks and Linguistic Theories (TLT'08)},
   year =        2009,
   pages =        {197--208},
   isbn =        {978-88-8311-712-1},
   editor =       {Marco Passarotti and Adam Przepiórkowski and 
                  Savina Raynaud and Frank Van Eynde},
   publisher =    {EDUCatt, Milano/Italy}
 }
Tiedemann, J. and Kotzé, G. (2009)

A Discriminative Approach to Tree Alignment. In Proceedings of the International Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in connection with RANLP'09), pages 33 - 39, 2009. http://stp.lingfil.uu.se/~joerg/published/ranlp09_tree.pdf

 @InProceedings{TiedemannKotze:RANLP09,
   author =      {Jörg Tiedemann and Gideon Kotzé},
   title =       {A Discriminative Approach to Tree Alignment},
   booktitle =   {Proceedings of the International Workshop on Natural
                  Language Processing Methods and Corpora in
                  Translation, Lexicography and Language Learning (in
                  connection with RANLP'09)},
   pages =       {33 -- 39},
   year =        2009,
   editor =      {Iustina Ilisei and Viktor Pekar and Silvia
                  Bernardini},
   isbn =        {978-954-452-010-6}
 }

Author ^

Joerg Tiedemann, <jorg.tiedemann@lingfil.uu.se>

Copyright and License ^

Copyright (C) 2009, 2010 by Joerg Tiedemann, Gideon Kotzé

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

Copyright for MegaM by Hal Daume III see http://www.cs.utah.edu/~hal/megam/ for more information Paper: Notes on CG and LM-BFGS Optimization of Logistic Regression, 2004 http://www.cs.utah.edu/~hal/docs/daume04cg-bfgs.pdf

syntax highlighting: