Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for a new machine translation benchmark, Tatoeba #15

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Traubert
Copy link

Hi, I'm proposing to integrate the Tatoeba machine translation dataset into sotabench-eval. I have included code for running the tests, modeled after WMT, and for downloading and configuring the data. I'm not 100% sure how the caching is supposed to work at the moment, I'll come back to that.

Currently you can:

import sotabencheval
from sotabencheval.machine_translation import TatoebaEvaluator, TatoebaDataset

# The test data will be downloaded and unpacked under the directory "tatoeba", this only needs to be done if the data isn't already present
sotabencheval.machine_translation.tatoeba.fetch_and_configure_data("tatoeba")
evaluator = TatoebaEvaluator(dataset=TatoebaDataset.v1, source_lang="eng", target_lang="deu", local_root="tatoeba", model_name="Some model", paper_arxiv_id="Some id")

evaluator.add({1: "Tom mag die italienische Küche.", 2: "Hier wirst du viel lernen."})
print(evaluator.get_results(ignore_missing = True))

You should be able to merge this without breaking anything, but please point me towards what else needs to be done...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant