tmx2cwb - encodes a pair of languages from a TMX file as a CWB corpus
tmx2cwb <-from=PT> <-to=EN> <-corpora=/corpora> <-registry=/usr/share/cwb/registry> <-toksource> <-toktarget> [file.tmx]
This program encodes a pair of languages extracted from a TMX file as a CWB corpus. Optionally it can tokenize the text (using basic tokenizing rules).
Accepted options are:
These two parameters are useful when more than two languages are available. They let the user to choose what languages to be encoded. When only two languages are present, they are used by default. If you want to force an order on those two languages, it is enough to specify one of the two options.
Path to the directory where the corpus should be encoded. Defaults to the
Path to the CWB registry folder. The tool tries to guess it using the
cwb-config command or the environment variable
CORPUS_REGISTRY. If not, you will need to specify it.
These two options can be used to make the tool to tokenize the source and/or the target language. Note that the used rules are good for Portuguese, acceptable for Spanish, English, French and Italian, and should be quite bad for other languages.
Prints basic help information.
Alberto Manuel Brandão Simões, <firstname.lastname@example.org>
Copyright (C) 2010-2011 by Alberto Manuel Brandão Simões