The scripts provided here assume you have a Unix environment (Linux, macOS, etc.) and have been tested using Python 3.10 on macOS 13.6.7 running on an a 2020 Intel MacBook Pro.
You'll need to have Python 3.8-3.10 installed. We suggest using pyenv to manage Python installations. You'll also need to install coreutils, CMake, and Boost installed.
On Debian/Ubuntu you can install these by running:
sudo apt-get update
sudo apt-get install coreutils cmake libboost-all-dev
On macOS using Homebrew you can run:
brew install coreutils cmake boost
We need to prepare the evaluation environment. All steps in this section only need to be run once.
First we need to setup the Python virtual environments used by all the scripts in this directory.
Note: You will have to rerun this if you move the parent directory and have run this command so previously.
./create_envs.sh
Then we need to prepare the file in the data
directory for
evaluation. This will generate a local data
directory with the prepared
files. To do so, we run:
./prepare_data.sh
Finally, we need to unlock the morphological analyzer databases that are built
using a dataset that isn't freely available.
First purchase a copy of SAMA 3.1
from the Linguistic Data Consortium and then download it from the
download page.
You should have a file called LDC2010L01.tgz
.
We can now unlock the database files by running:
./unmuddle_dbs.sh /path/to/LDC2010L01.tgz
To generate and evaluate predictions for the Wild2Max dev set, we run:
./run_wild2max_dev.sh
To generate and evaluate predictions for the Wild2Max test set, run:
./run_wild2max_test.sh
The above commands generate diacritization predictions in
output/predictions/wild2max
and final evaluation statistics in
output/eval/wild2max
.
Files in output/predictions
of the form *.original.tsv
contain
predictions for individual genres using the original implementation of
CAMeL Tools and an unmodified calima-s31 morphological database.
Those of the form *.extended.tsv
are created using our
modified version of CAMeL Tools as well as the extended version of calima-s31.
This includes predictions using our CT++ ranking algorithm.
Files in output/eval
contain the computed statistics that we report in
our paper and follow the *.original.tsv
and *.extended.tsv
conventions mentioned above.
See the TSV Output Column Reference below for more information on the contents of these files.
To generate and evaluate predictions for WikiNewsMax, run:
./run_wikinewsmax.sh
The above command generates diacritization predictions in
output/predictions/wikinewsmax
and final evaluation statistics in
output/eval/wikinews
.
We use the same file suffix naming conventions above for generated prediction and evaluation files.
For this task we produce two sets of results.
Both sets produce predictions using dediacritized WikiNews text as input.
One uses the original WikiNews gold diacritizations to evaluate against and
uses dediac_orig_gold
as a file prefix.
The other uses WikiNewsMax gold and alternative gold diacritizations to
evaluate against and uses dediac_max_gold
as a file prefix.
See the TSV Output Column Reference below for more information on the contents of these files.
Below are reference tables for the column names used in the produced output
files in output/predictions
and output/eval
.
Field Name | Description |
---|---|
word |
the original word |
gold_diac |
the gold (full) diacritization of the word |
gold_diac_alt |
optional alternative gold (full) diacritization of the word |
is_oov |
word is out-of-vocabulary (ie. no analyses were produced or all produced analyses are backoffs) |
ct_noctx |
predicted diacritization using original CAMeL Tools ranking and no contextual fixes |
ct_soloctx |
predicted diacritization using original CAMeL Tools ranking and solo word contextual fixes |
ct_fullctx |
predicted diacritization using original CAMeL Tools ranking and full sentence contextual fixes |
ctpp_soloctx |
predicted diacritization produced using CT++ ranking and solo word contextual fixes |
ctpp_fullctx |
predicted diacritization produced using CT++ ranking and full sentence contextual fixes |
oracle_noctx |
oracle (best possible) diacritization provided no contextual fixes |
oracle_soloctx |
oracle (best possible) diacritization provided solo word contextual fixes |
oracle_fullctx |
oracle (best possible) diacritization provided full sentence contextual fixes |
Field Name | Description |
---|---|
genre |
the genre being evaluated |
num_words |
total number of words in the given genre |
oov |
percentage of words that are out-of-vocabulary |
ct_noctx_accuracy |
percentage of words that have a correct ct_noctx prediction |
ct_soloctx_accuracy |
percentage of words that have a correct ct_soloctx prediction |
ct_fullctx_accuracy |
percentage of words that have a correct ct_fullctx prediction |
ctpp_soloctx_accuracy |
percentage of words that have a correct ctpp_soloctx prediction |
ctpp_fullctx_accuracy |
percentage of words that have a correct ctpp_fullctx prediction |
oracle_noctx_accuracy |
percentage of words that have a correct oracle_noctx prediction |
oracle_soloctx_accuracy |
percentage of words that have a correct oracle_soloctx prediction |
oracle_fullctx_accuracy |
percentage of words that have a correct oracle_fullctx prediction |