Welcome to OTEDAMA, a system for extracting preordering rules for machine translation. Otedama currently works for English and Chinese (using the Stanford Parser) as source languages and any target language for which word alignments can be extracted. External source-language parsers using the CoNLL output format are also supported.
OTEDAMA comes pre-packaged with the Stanford Parser. You just need a working installation of java (Version 8.0 or higher, including a compiler).
- Download Otedama by running:
git clone https://github.com/StatNLP/otedama
- Download the Stanford Parser models by running:
cd otedama/lib/parser/
wget http://www.cl.uni-heidelberg.de/statnlpgroup/otedama/stanford-parser-3.5.2-models.jar
cd ../..
- Running
./build.sh
in the installation directory should create the filesparser.jar
,zhparser.jar
,conll2tree.jar
,learner.jar
andreorder.jar
in the bin subdirectory. - If all went well, you should be able to run the example script (see example/README.txt for instructions).
Note that because output sentences are reconstructed from parse trees, the parser's tokenization is preserved in the input (if your parser performs automatic tokenization). For meaningful comparison, it is recommended you use tokenized text for your experiments or apply de-tokenization after re-ordering.
In order to use the Stanford parser, you need to download the models which are not included in this repository.
- Download the parser models from here (282 MB)
- Copy stanford-parser-3.5.2-models.jar to the
lib/parser/
directory in this repo.
You can also try using a different version of the Stanford parser and parser models (which will probably work just fine), but this has not been tested.
java -jar bin/parser.jar <english training corpus> <alignments english-foreign> <output file> <verbose_mode: v(erbose) or q(uiet)>
Example:
java -jar bin/parser.jar train.en train.en-fr.align train.en-fr.trees v
Specifying "no_align" instead of an alignment file will create an unaligned treebank (such as for reordering test data).
java -jar bin/parser.jar <english_training_corpus> no_align <output_file> <verbose_mode: v(erbose) or q(uiet)>
Example:
java -jar bin/parser.jar test.en no_align test.en.trees q
- For Chinese, use
bin/zhparser.jar
instead ofbin/parser.jar
. - For external dependency parsers that produce output in CoNLL format (http://ilk.uvt.nl/conll/#dataformat), run
bin/conll2tree.jar <input file source in CONLL format>, <alignment file source-target>, <output file>
to convert labelled dependencies to trees in Otedama's own format. - It is planned to support other output from external parsers eventually.
java -jar bin/learner.jar <config file>
Example:
java -jar bin/learner.jar example/example.config
Parameters for the learner are specified in a config file. See example/example.config for a template. The parameters are as follows:
- TRAINING_TREEBANK_FILE: Parsed training data, created by parser.jar or conll2tree.jar
- INITIAL_SUBSAMPLE_SIZE=10: A new random subset of the specified size is chosen as a learning treebank for extracting rule candidates in each iteration. Bigger training subsets lead to longer training times, but increase the probability of finding a near-optimal new rule in each iteration. Should be at leat 4N, where N is the number of CPUs available for adequate load balancing.
- MAX_RULE_CROSSING_SCORE: Bound on the effect on alignment monotonicity of the whole training corpus for candidate rules. Should be 0 or a negative number.
- MIN_MATCHING_FEATURES: For window size 4: 12 for exact matching, <12 for fuzzy matching. For window size 3: 10 for exact matching, <10 for fuzzy matching. For window size 2: 8 for exact matcing, <8 for fuzzy matching The given value is adjusted automatically for nodes that have few children and thus cannot fulfill the given matching criterion. Small values yield very poor results.
- PARALLEL_THREADS: Number of parallel threads for learning.
- RULE_OUTPUT_FILE: Output model file where learned rules are written.
- MIN_REDUCTION_FACTOR: Variance constraint for rules: Each new rule must reduce crossing score on n times as many sentences as number of sentences where it increases crossing score. For noisy data sets, a minimum reduction factor of 2.0 is recommended.
- WINDOW_SIZE: Size of the sliding window during rule extraction. Values between 2 and 4 are supported.
- USE_FEATURE_SUBSETS: If 'y', subsets of the matching context are also extracted as feature sets for candidate rules.
- LOGGING: v[erbose] or q[uiet]
- MAX_WAITING_TIME_MINS: Maximum time in minutes that may elapse without any new rules being learned before training is stopped. Longer time limits will result in more rules being learned. For large datasets, a minimum of 10 minutes waiting time is recommended.
java -jar bin/reorder.jar <english treebank> <rule file> <config file> <english output treebank> <english surface output> <batch-size> <number of parallel threads>
Example:
java -jar bin/reorder.jar test.en.trees en-fr.rules en-fr.config test.en.trees.reordered test.en.reordered 1000 10 10
Greater batch sizes can lead to higher memory consumption but higher throughput as well. It is highly recommended to use the same config file as in the learning step.
Good luck and have fun!
##Example Script
For instructions on how to run an example script of the entire Otedama pipeline, see example/README.txt
##Computing Alignment Monotonicity
A python script for computing the monotonicity of a given word alignment can be found at scripts/crossing_score.py. For example, you can run
python crossing_score.py <alignment_file> none
to obtain the total number of alignment crossings in <alignment_file>. Optionally, a file with alignment transformations can be specified as the second parameter.
Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner, Stefan Riezler. Otedama: Fast Rule-Based Pre-Ordering for Machine Translation. The Prague Bulletin of Mathematical Linguistics, vol. 106, pp. 159-168. Charles University, Prague, Czech Republic, 2016: http://ufal.mff.cuni.cz/pbml/106/art-hitschler-et-al.pdf
D. Genzel. Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ACL 2010, (August):376-384, 2010.
M. Collins, et. al. Clause restructuring for statistical machine translation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ACL05, (June):531-540, 2005.
stanford-parser.jar
For questions and comments, please contact Julian Hitschler: [email protected]. We are happy to help with any issue you may encounter while using Otedama, no matter how big or small.
OTEDAMA was created 2013-2016 by Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner and Stefan Riezler at the Institute for Computational Linguistics, University of Heidelberg, Germany (www.cl.uni-heidelberg.de). OTEDAMA is licensed under the BSD 2-Clause License: https://opensource.org/licenses/bsd-license.php OTEDAMA is named after a Japanese juggling game for children: https://en.wikipedia.org/wiki/Otedama
If you use Otedama, please cite the following publication:
Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner, Stefan Riezler. Otedama: Fast Rule-Based Pre-Ordering for Machine Translation. The Prague Bulletin of Mathematical Linguistics, vol. 106, pp. 159-168. Charles University, Prague, Czech Republic, 2016: http://ufal.mff.cuni.cz/pbml/106/art-hitschler-et-al.pdf