The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

processAlignment_alSet-version.pl - apply a function to each alignment of the Alignment Set

SYNOPSIS

perl processAlignment_alSet-version.pl [options] required_arguments

Required arguments:

    -ist FILENAME    Input source-to-target links file
    -if BLINKER|GIZA|NAACL    Input file(s) format (required if not TALP)
    -ost FILENAME    Output source-to-target links file
    -of BLINKER|GIZA|NAACL    Output file(s) format (required if not TALP)
    -sub SUBROUTINE    Subroutine name (package::subroutine)
    (such as Lingua::Alignment::forceGroupConsistency, swapSourceTarget, intersect, getUnion)
    If the subroutine needs arguments: -sub SUBROUTINE -sub ARG_1 -sub ARG_2 etc.
    Look manual for more subroutines (-man option)
  

Options:

        -is FILENAME    Input source words file
        -it FILENAME    Input target words file
        -its FILENAME Input target-to-source links file
        -os FILENAME    Output source words file
        -ot FILENAME    Output target words file
        -ots FILENAME Output target-to-source links file
        -range BEGIN-END    Input Alignment Set range
        -alignMode as-is|null-align|no-null-align    Alignment mode
        -help|?    Prints the help and exits
        -man    Prints the manual and exits

ARGUMENTS

--ist,--i_st,--i_sourceToTarget FILENAME

Input source-to-target (i.e. links) file name (or directory, in case of BLINKER format)

--if,--i_format BLINKER|GIZA|NAACL

Input Alignment Set format (required if different from default, TALP).

--ost,--o_st,--o_sourceToTarget FILENAME

Output (new format) source-to-target (i.e. links) file name (or directory, in case of BLINKER format)

--of,--o_format BLINKER|GIZA|NAACL

Output (new) Alignment Set format (required if different from default, TALP)

--sub,--alignmentSub SUBROUTINE --sub,--alignmentSub ARG_1 etc.

Name of the subroutine to be applied to each alignment of the Alignment Set. If the subroutine takes arguments, call this item for each argument (except the ref to the Alignment object), respecting the order. For instance, a call to MySub with two arguments arg1 and arg2 would look like:

--sub MySub --sub arg1 --sub arg2

The Lingua::Alignment.pm module contains functions:

<Lingua::Alignment::forceGroupConsistency>

Prohibits situations of the type {if linked(e,f) and linked(e',f) and linked(e',f') but not linked(e,f')} by linking e and f'

<Lingua::Alignment::swapSourceTarget>

Swaps source and target in the alignments: a link (6 3) becomes (3 6)

<Lingua::Alignment::regexpReplace>

Substitutes, in a side of the corpus, a string (defined by a regular expression) by another and updates the links accordingly. There are 3 arguments: the regular expressions (pattern and replacement) and the side (source or target) (see the man examples). Notes:

  • In case of deleting various words, all added words are linked to all positions to which deleted words were linked. $al->{sourceLinks} information can be lost for replaced words.

  • The regexp is applied to the side of the corpus, and the smallest set of additions and deletions necessary to turn the original word sequence into the modified one is computed using algorithm::diff. In practice, this set is not always minimal, and in these cases various words are replaced by various so links may be changed. To avoid this problem use replaceWords subroutine.

  • Is more eficient in "source" side than in "target" side.

<Lingua::Alignment::replaceWords>

Substitutes, in a side of the corpus, a string (of words separated by a white space) by another and updates the links accordingly. There are 3 arguments: the string of words to be replaced, the string of replacement words and the side (source or target) (see the man examples). Notes:

  • In case of deleting various words, all added words are linked to all positions to which deleted words were linked. $al->{sourceLinks} information can be lost for replaced words.

  • Is more eficient in "source" side than in "target" side.

<Lingua::Alignment::intersect>

Takes the intersection between source-to-target and target-to-source alignments

<Lingua::Alignment::getUnion>

Takes the union between source-to-target and target-to-source alignments

etc. See the AlignmentSet.pm module documentation for more functions

OPTIONS

--is,--i_s,--i_source FILENAME

Input source (words) file name. Not applicable in GIZA Format.

--it,--i_t,--i_target FILENAME

Input target (words) file name. Not applicable in GIZA Format.

--its,--i_ts,--i_targetToSource FILENAME

Input target-to-source (i.e. links) file name (or directory, in case of BLINKER format)

--range BEGIN-END

Range of the input source-to-target file (BEGIN and END are the sentence pair numbers)

--os,--o_s,--o_source FILENAME

Output (new format) source (words) file name. Not applicable in GIZA Format.

--ot,--o_t,--o_target FILENAME

Output (new format) target (words) file name. Not applicable in GIZA Format.

--ots,--o_ts,--o_targetToSource FILENAME

Output (new format) target-to-source (i.e. links) file name (or directory, in case of BLINKER format)

--alignMode as-is|null-align|no-null-align

Take alignment "as-is" or force NULL alignment or NO-NULL alignment (see AlignmentSet.pm documentation).

--help, --?

Prints a help message and exits.

--man

Prints a help message and exits.

DESCRIPTION

Allows to process the AlignmentSet applying a function to the alignment of each sentence pair of the set. The Alignment.pm module contains such functions. The command-line utility has been made for convenience. For full details, see the documentation of the AlignmentSet.pm module.

EXAMPLES

Swapping source and target in source-to-target links file:

perl processAlignment_alSet-version.pl -ist test-giza.eng2spa.naacl -ost test-giza.swapped -sub Lingua::Alignment::swapSourceTarget

Remove '?' and '.' from the source side of the corpus:

perl processAlignment_alSet-version.pl -ist data/spanish-english.naacl -is data/spanish.naacl -ost data/spanish-english-without.naacl -os data/spanish-without.naacl -sub Lingua::Alignment::regexpReplace -sub '\?|\.' -sub '' -sub source

AUTHOR

Patrik Lambert <lambert@talp.upc.es>

COPYRIGHT AND LICENSE

Copyright 2004 by Patrick Lambert

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (version 2 or any later version).

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 242:

You forgot a '=back' before '=head1'

Around line 244:

'=item' outside of any '=over'