Name		Name	Last commit message	Last commit date
parent directory ..
databases		databases
dev_utils		dev_utils
.gitignore		.gitignore
README.md		README.md
create_python_envs.sh		create_python_envs.sh
ctpp.py		ctpp.py
evaluate.py		evaluate.py
gen_predictions_sent.py		gen_predictions_sent.py
gen_predictions_word.py		gen_predictions_word.py
oracle.py		oracle.py
prepare_data.sh		prepare_data.sh
prepare_wikinewsmax.py		prepare_wikinewsmax.py
prepare_wild2max.py		prepare_wild2max.py
requirements_modified.txt		requirements_modified.txt
requirements_original.txt		requirements_original.txt
run_wikinewsmax.sh		run_wikinewsmax.sh
run_wild2max_dev.sh		run_wild2max_dev.sh
run_wild2max_test.sh		run_wild2max_test.sh
unmuddle_dbs.sh		unmuddle_dbs.sh
wilddiac_utils.py		wilddiac_utils.py

README.md

Exploiting Wild Diactitics Evaluation

The scripts provided here assume you have a Unix environment (Linux, macOS, etc.) and have been tested using Python 3.10 on macOS 13.6.7 running on an a 2020 Intel MacBook Pro.

Prerequisites

You'll need to have Python 3.8-3.10 installed. We suggest using pyenv to manage Python installations. You'll also need to install coreutils, CMake, and Boost installed.

On Debian/Ubuntu you can install these by running:

sudo apt-get update
sudo apt-get install coreutils cmake libboost-all-dev

On macOS using Homebrew you can run:

brew install coreutils cmake boost

Initial Setup

We need to prepare the evaluation environment. All steps in this section only need to be run once.

First we need to setup the Python virtual environments used by all the scripts in this directory.

Note: You will have to rerun this if you move the parent directory and have run this command so previously.

./create_envs.sh

Then we need to prepare the file in the data directory for evaluation. This will generate a local data directory with the prepared files. To do so, we run:

./prepare_data.sh

Finally, we need to unlock the morphological analyzer databases that are built using a dataset that isn't freely available. First purchase a copy of SAMA 3.1 from the Linguistic Data Consortium and then download it from the download page. You should have a file called LDC2010L01.tgz.

We can now unlock the database files by running:

./unmuddle_dbs.sh /path/to/LDC2010L01.tgz

Evaluating Wild2Max

To generate and evaluate predictions for the Wild2Max dev set, we run:

./run_wild2max_dev.sh

To generate and evaluate predictions for the Wild2Max test set, run:

./run_wild2max_test.sh

The above commands generate diacritization predictions in output/predictions/wild2max and final evaluation statistics in output/eval/wild2max.

Files in output/predictions of the form *.original.tsv contain predictions for individual genres using the original implementation of CAMeL Tools and an unmodified calima-s31 morphological database. Those of the form *.extended.tsv are created using our modified version of CAMeL Tools as well as the extended version of calima-s31. This includes predictions using our CT++ ranking algorithm.

Files in output/eval contain the computed statistics that we report in our paper and follow the *.original.tsv and *.extended.tsv conventions mentioned above.

See the TSV Output Column Reference below for more information on the contents of these files.

Evaluating WikiNewsMax

To generate and evaluate predictions for WikiNewsMax, run:

./run_wikinewsmax.sh

The above command generates diacritization predictions in output/predictions/wikinewsmax and final evaluation statistics in output/eval/wikinews.

We use the same file suffix naming conventions above for generated prediction and evaluation files.

For this task we produce two sets of results. Both sets produce predictions using dediacritized WikiNews text as input. One uses the original WikiNews gold diacritizations to evaluate against and uses dediac_orig_gold as a file prefix. The other uses WikiNewsMax gold and alternative gold diacritizations to evaluate against and uses dediac_max_gold as a file prefix.

See the TSV Output Column Reference below for more information on the contents of these files.

TSV Output Column Reference

Below are reference tables for the column names used in the produced output files in output/predictions and output/eval.

Prediction Files

Field Name	Description
`word`	the original word
`gold_diac`	the gold (full) diacritization of the word
`gold_diac_alt`	optional alternative gold (full) diacritization of the word
`is_oov`	word is out-of-vocabulary (ie. no analyses were produced or all produced analyses are backoffs)
`ct_noctx`	predicted diacritization using original CAMeL Tools ranking and no contextual fixes
`ct_soloctx`	predicted diacritization using original CAMeL Tools ranking and solo word contextual fixes
`ct_fullctx`	predicted diacritization using original CAMeL Tools ranking and full sentence contextual fixes
`ctpp_soloctx`	predicted diacritization produced using CT++ ranking and solo word contextual fixes
`ctpp_fullctx`	predicted diacritization produced using CT++ ranking and full sentence contextual fixes
`oracle_noctx`	oracle (best possible) diacritization provided no contextual fixes
`oracle_soloctx`	oracle (best possible) diacritization provided solo word contextual fixes
`oracle_fullctx`	oracle (best possible) diacritization provided full sentence contextual fixes

Evaluation Files

Field Name	Description
`genre`	the genre being evaluated
`num_words`	total number of words in the given genre
`oov`	percentage of words that are out-of-vocabulary
`ct_noctx_accuracy`	percentage of words that have a correct `ct_noctx` prediction
`ct_soloctx_accuracy`	percentage of words that have a correct `ct_soloctx` prediction
`ct_fullctx_accuracy`	percentage of words that have a correct `ct_fullctx` prediction
`ctpp_soloctx_accuracy`	percentage of words that have a correct `ctpp_soloctx` prediction
`ctpp_fullctx_accuracy`	percentage of words that have a correct `ctpp_fullctx` prediction
`oracle_noctx_accuracy`	percentage of words that have a correct `oracle_noctx` prediction
`oracle_soloctx_accuracy`	percentage of words that have a correct `oracle_soloctx` prediction
`oracle_fullctx_accuracy`	percentage of words that have a correct `oracle_fullctx` prediction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exploiting_wilddiacs

exploiting_wilddiacs

README.md

Exploiting Wild Diactitics Evaluation

Prerequisites

Initial Setup

Evaluating Wild2Max

Evaluating WikiNewsMax

TSV Output Column Reference

Prediction Files

Evaluation Files

Files

exploiting_wilddiacs

Directory actions

More options

Directory actions

More options

Latest commit

History

exploiting_wilddiacs

Folders and files

parent directory

README.md

Exploiting Wild Diactitics Evaluation

Prerequisites

Initial Setup

Evaluating Wild2Max

Evaluating WikiNewsMax

TSV Output Column Reference

Prediction Files

Evaluation Files