#HOW TO RUN

Command

python rerank_refactored.py --language $1 --subdirectory withAffixFeatsNoAffixCands/ --config_file rerankll.m3ps.evalonly.config --l1 $NUM --l2 $NUM (--nn neural_network_layers)

e.g.

python rerank_refactored.py --language az --subdirectory withAffixFeatsNoAffixCands/ --config_file rerankll.m3ps.evalonly.config --l1 0.01 --l2 0.01 --nn 10

Pre-requisites

For this to work you directory should have:

A language directory (the subdirectory option) with crawled data e.g. az a. this language directory should have the input data for the language which could be wiki or crawled data processed by Babel. e.g. az/crawls-minf10
A burstinessMeasures directory with burst.<$language$>.en and burst.<$language$>.<$language$> files
a config file, which specifies various configuration options. An example would be:

useprefix=true useprefixcands=false inputDataDirectory=. burstinessMeasuresDirectory=burstinessMeasures wikipredir=wiki-minf10-Prefix/output crawlspredir=crawls-minf10-Prefix/output usesuffix=true usesuffixcands=false wikisufdir=wiki-minf10-Suffix/output crawlssufdir=crawls-minf10-Suffix/output byfreq=false crawlsdir=crawls-minf10/output wikidir=wiki-minf10/output frwikiwords=wiki-minf10/output/srcinduct.list frcrawlswords=crawls-minf10/output/srcinduct.list dictfile=/nlp/users/shreejit/MTurkDicts/mturk.$LANG$ useLog=true readData=true writeData=true doClassification=true
An MTurk dictionary file as specified in the config file. which has individual word translations as gathered by MTurk HITS

The wikipredir, crawlspredir, wikisufdir, wikidir, crawlsdir, frwikiwords, frcrawlswords options assume that these directories are within the language subdirectory i.e. within inputDataDirectory/subDirectory/<$wikipredir$> etc...

This will then generate the directory mentioned as in the --subdirectory with a directory inside it. If the writeData is set to true, then it will recreate training, test and blind data (where training and testing is used for learning parameters, and blind data is used for evaluation).

So in the subdirectory/language you should have 3 sets of data

train.
test.
blind.

The learning is done with the vowpal wabbit tool so that needs to be installed as well.

The script will run the evaluation and print out the scores for the rankings. The Cand1, Cand10 and Cand100 numbers are to be interpreted as the number of correct translations in the top n candiates.

Steps

This is an example of a full run with the Azeri language. First make sure you have mturk.az in /nlp/users/shreejit/MTurkDicts (according to the example config file)
Make sure you have the az/crawls-minf10/output directory, where az is in the inputDataDirectory. For now we are looking for this exact directory name, which means that within the az language directory, there is a crawls directory which contains data for only words that appear more than 10 times in the corpus. The output subdirectory is the directory with processed data. It should have the following files:
- aggmrr.eval
aggmrr.scored context.eval context.scored edit.eval edit.scored srcinduct.list time.eval time.scored
Make sure you have the burstinessMeasures directory in the inputDataDirectory (the cwd). It should have:
- burst.az.az
- burst.az.en
Make sure you have the config file. This is used by the inductor and evaluator. It should look something like this:
- useprefix=true
useprefixcands=false inputDataDirectory=. burstinessMeasuresDirectory=burstinessMeasures wikipredir=wiki-minf10-Prefix/output crawlspredir=crawls-minf10-Prefix/output usesuffix=true usesuffixcands=false wikisufdir=wiki-minf10-Suffix/output crawlssufdir=crawls-minf10-Suffix/output byfreq=false crawlsdir=crawls-minf10/output wikidir=wiki-minf10/output frwikiwords=wiki-minf10/output/srcinduct.list frcrawlswords=crawls-minf10/output/srcinduct.list dictfile=/nlp/users/shreejit/MTurkDicts/mturk.$LANG$ useLog=true readData=true writeData=true doClassification=true
Run the following command with the language, output subdirectory, config file and regularization parameters. E.g.
- python rerank_refactored.py --language az --subdirectory withAffixFeatsNoAffixCands/ --config_file rerankll.m3ps.ref.config --l1 0.01 --l2 0.01
You should now have a withAffixFeatsNoAffixCands directory, with az inside it in the current working directory (./withAffixFeatsNoAffixCands/az). It should have the following files:
- az.NoEnRanked (describe the files)
- blind.data
- blind.data.index
- test.data
- test.data.index
- test.predictions
- test.rankcompare
- test.reranked.scored
- test.scores
- train.data
- train.data.cache
- train.data.index
- train.model
- train.model.readable
The program should also give the accuracy of the run to stdout in a format like so:
- Mean reciprocal rank of AGG-MRR ranks: 0.0 Mean reciprocal rank of ML-learned weighted ranks: 0.132917895797
  
  Accuracy in top-1: MRR: 1.0 Cand: 0.0771670190275
  
  Accuracy in top-10: MRR: 1.0 Cand: 0.233615221987
  
  Accuracy in top-100: MRR: 1.0 Cand: 0.492600422833

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Run (short version).md

How to Run (short version).md

Command

Pre-requisites

Steps

Printing translations for specific words in the foreign language

Description of file formats

Files

How to Run (short version).md

Latest commit

History

How to Run (short version).md

File metadata and controls

Command

Pre-requisites

Steps

Printing translations for specific words in the foreign language

Description of file formats