The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

--watch option

SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.

BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.

TODO: * modules are just reloaded, no constructors are called yet

NAME

Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data

VERSION

version 2.20150928

SYNOPSIS

In bash:

 > treex myscenario.scen -- data/*.treex
 > treex My::Block1 My::Block2 -- data/*.treex

In Perl:

 use Treex::Core::Run q(treex);
 treex([qw(myscenario.scen -- data/*.treex)]);
 treex([qw(My::Block1 My::Block2 -- data/*.treex)]);

DESCRIPTION

Treex::Core::Run allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex. However, the same list of arguments can be passed by an array reference to the function treex() imported from Treex::Core::Run.

Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p. The treex method then creates a Treex::Core::Parallel::Head object, which extends Treex::Core::Run by providing parallel processing functionality.

Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local switch.

SUBROUTINES

treex

create new runner and runs scenario given in parameters

USAGE

 usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files]
 scenario is a sequence of blocks or *.scen files
 options:
        -h -? --usage --help         Prints this usage information.
        -s --save                    save all documents
        -q --quiet                   Warning, info and debug messages are
                                     suppressed. Only fatal errors are
                                     reported.
        --cleanup                    Delete all temporary files.
        -e --error_level             Possible values: ALL, DEBUG, INFO, WARN,
                                     FATAL
        -L --language --lang         shortcut for adding "Util::SetGlobal
                                     language=xy" at the beginning of the
                                     scenario
        -S --selector                shortcut for adding "Util::SetGlobal
                                     selector=xy" at the beginning of the
                                     scenario
        -t --tokenize                shortcut for adding "Read::Sentences
                                     W2A::Tokenize" at the beginning of the
                                     scenario (or W2A::XY::Tokenize if used
                                     with --lang=xy)
        --watch                      re-run when the given file is changed
                                     TODO better doc
        -d --dump_scenario           Just dump (print to STDOUT) the given
                                     scenario and exit.
        --dump_required_files        Just dump (print to STDOUT) files
                                     required by the given scenario and exit.
        --cache                      Use cache. Required memory is specified
                                     in format memcached,loading. Numbers are
                                     in GB.
        -v --version                 Print treex and perl version
        -E --forward_error_level     messages with this level or higher will
                                     be forwarded from the distributed jobs
                                     to the main STDERR
        -p --parallel                Parallelize the task on SGE cluster
                                     (using qsub).
        -j --jobs                    Number of jobs for parallelization,
                                     default 10. Requires -p.
        --local                      Run jobs locally (might help with
                                     multi-core machines). Requires -p.
        --priority                   Priority for qsub, an integer in the
                                     range -1023 to 0 (or 1024 for admins),
                                     default=-100. Requires -p.
        --memory -m --mem            How much memory should be allocated for
                                     cluster jobs, default=2G. Requires -p.
                                     Translates to "qsub -hard -l
                                     mem_free=$mem -l h_vmem=2*$mem -l
                                     act_mem_free=$mem". Use --mem=0 and
                                     --qsub to set your own SGE settings
                                     (e.g. if act_mem_free is not available).
        --name                       Prefix of submitted jobs. Requires -p.
                                     Translates to "qsub -N $name-jobname".
        --qsub                       Additional parameters passed to qsub.
                                     Requires -p. See --priority and --mem.
                                     You can use e.g. --qsub="-q *@p*,*@s*"
                                     to use just machines p* and s*. Or e.g.
                                     --qsub="-q *@!(twi*|pan*)" to skip twi*
                                     and pan* machines.
        --workdir                    working directory for temporary files in
                                     parallelized processing; one can create
                                     automatic directories by using patterns:
                                     {NNN} is replaced by an ordinal number
                                     with so many leading zeros to have
                                     length of the number of Ns, {XXXX} is
                                     replaced by a random string, whose
                                     length is the same as the number of Xs
                                     (min. 4). If not specified, directories
                                     such as 001-cluster-run, 002-cluster-run
                                     etc. are created
        --survive                    Continue collecting jobs' outputs even
                                     if some of them crashed (risky, use with
                                     care!).
        --jobindex                   Not to be used manually. If number of
                                     jobs is set to J and modulo set to M,
                                     only I-th files fulfilling I mod J == M
                                     are processed.
        --outdir                     Not to be used manually. Dictory for
                                     collecting standard and error outputs in
                                     parallelized processing.
        --server                     Not to be used manually. Used to point
                                     parallel jobs to the head.

AUTHORS

Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>

Martin Popel <popel@ufal.mff.cuni.cz>

Martin Majliš

Ondřej Dušek <odusek@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.