SYNOPSIS: touch timestamp.file treex --watch=timestamp.file my.scen & # or without & and open another terminal # after all documents are processed, treex is still running, watching timestamp.file # you can modify any modules/blocks and then touch timestamp.file # All modified modules will be reloaded (the number of reloaded modules is printed). # The document reader is restarted, so it starts reading the first file again. # To exit this "watching loop" either rm timestamp.file or press Ctrl^C.
BENEFITS: * much faster development cycles (e.g. most time of en-cs translation is spent on loading) * Now I have some non-deterministic problems with loading NER::Stanford - using --watch I get it loaded on all jobs once and then I don't have to reload it.
TODO: * modules are just reloaded, no constructors are called yet
Treex::Core::Run + treex - applying Treex blocks and/or scenarios on data
version 1.20150902
In bash:
> treex myscenario.scen -- data/*.treex > treex My::Block1 My::Block2 -- data/*.treex
In Perl:
use Treex::Core::Run q(treex); treex([qw(myscenario.scen -- data/*.treex)]); treex([qw(My::Block1 My::Block2 -- data/*.treex)]);
Treex::Core::Run allows to apply a block, a scenario, or their mixture on a set of data files. It is designed to be used primarily from bash command line, using a thin front-end script called treex. However, the same list of arguments can be passed by an array reference to the function treex() imported from Treex::Core::Run.
Treex::Core::Run
treex
treex()
Note that this module supports distributed processing (Linux-only!), simply by adding the switch -p. The treex method then creates a Treex::Core::Parallel::Head object, which extends Treex::Core::Run by providing parallel processing functionality.
-p
Treex::Core::Parallel::Head
Then there are two ways to process the data in a parallel fashion. By default, SGE cluster\'s qsub is expected to be available. If you have no cluster but want to make the computation parallelized at least on a multicore machine, add the --local switch.
qsub
--local
create new runner and runs scenario given in parameters
usage: treex [-?dEehjLmpqSstv] [long options...] scenario [-- treex_files] scenario is a sequence of blocks or *.scen files options: -h -? --usage --help Prints this usage information. -s --save save all documents -q --quiet Warning, info and debug messages are suppressed. Only fatal errors are reported. --cleanup Delete all temporary files. -e --error_level Possible values: ALL, DEBUG, INFO, WARN, FATAL -L --language --lang shortcut for adding "Util::SetGlobal language=xy" at the beginning of the scenario -S --selector shortcut for adding "Util::SetGlobal selector=xy" at the beginning of the scenario -t --tokenize shortcut for adding "Read::Sentences W2A::Tokenize" at the beginning of the scenario (or W2A::XY::Tokenize if used with --lang=xy) --watch re-run when the given file is changed TODO better doc -d --dump_scenario Just dump (print to STDOUT) the given scenario and exit. --dump_required_files Just dump (print to STDOUT) files required by the given scenario and exit. --cache Use cache. Required memory is specified in format memcached,loading. Numbers are in GB. -v --version Print treex and perl version -E --forward_error_level messages with this level or higher will be forwarded from the distributed jobs to the main STDERR -p --parallel Parallelize the task on SGE cluster (using qsub). -j --jobs Number of jobs for parallelization, default 10. Requires -p. --local Run jobs locally (might help with multi-core machines). Requires -p. --priority Priority for qsub, an integer in the range -1023 to 0 (or 1024 for admins), default=-100. Requires -p. --memory -m --mem How much memory should be allocated for cluster jobs, default=2G. Requires -p. Translates to "qsub -hard -l mem_free=$mem -l h_vmem=2*$mem -l act_mem_free=$mem". Use --mem=0 and --qsub to set your own SGE settings (e.g. if act_mem_free is not available). --name Prefix of submitted jobs. Requires -p. Translates to "qsub -N $name-jobname". --qsub Additional parameters passed to qsub. Requires -p. See --priority and --mem. You can use e.g. --qsub="-q *@p*,*@s*" to use just machines p* and s*. Or e.g. --qsub="-q *@!(twi*|pan*)" to skip twi* and pan* machines. --workdir working directory for temporary files in parallelized processing; one can create automatic directories by using patterns: {NNN} is replaced by an ordinal number with so many leading zeros to have length of the number of Ns, {XXXX} is replaced by a random string, whose length is the same as the number of Xs (min. 4). If not specified, directories such as 001-cluster-run, 002-cluster-run etc. are created --survive Continue collecting jobs' outputs even if some of them crashed (risky, use with care!). --jobindex Not to be used manually. If number of jobs is set to J and modulo set to M, only I-th files fulfilling I mod J == M are processed. --outdir Not to be used manually. Dictory for collecting standard and error outputs in parallelized processing. --server Not to be used manually. Used to point parallel jobs to the head.
Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
Martin Popel <popel@ufal.mff.cuni.cz>
Martin Majliš
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
Copyright © 2011-2014 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Treex::Core, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Treex::Core
CPAN shell
perl -MCPAN -e shell install Treex::Core
For more information on module installation, please visit the detailed CPAN module installation guide.