Below we describe how the participants can submit their results, and how the winner(s) will be announced.
Final evaluation for the SIMMC DSTC9 track will be on the test-std
split, different from the devtest
split. Each test instance in test-std
contains only K
number of rounds (not necessarily the entire dialog), where we release the user utterances from 1
to K
rounds, and system utterances from 1
to K-1
utterances. Please refer to this table that lists the set of allowed inputs for each subtask.
For subtask 1, evaluation is on the assistant action (API call) for K
th round.
For subtask 2, evaluation is on the assistant utterance generation for K
th round.
For subtask 3, evaluation is on dialog state prediction based on user utterances from 1
through K
.
For subtasks 1 and 2 there are 1.2K predictions (1 per dialogue). For subtask 3 there are mean(K
) * number of dialogues predictions.
We provide:
-
devtest
, in thetest-std
format: to give participants an early heads-up on how thetest-std
dataset will look like, we re-formatted the already-releaseddevtest
set in the format of thetest-std
file. Please ensure that your script and model are compatible and can run on fashion_devtest_dials_teststd_format_public.json and furniture_devtest_dials_teststd_format_public.json. Please note that the Evaluation Phase 1 is on the entiredevtest
set. -
test-std
: In the main data folder, we release thetest-std
dataset for Evalaution Phase 2: Please check out./data/simmc_{domain}/{domain}_teststd_dials{_|_api_calls_|_retrieval_candidates_}public.json
, and report the prediction results on those following the instructions below.
Subtask | Evaluation | Metric Priority List |
---|---|---|
Subtask 1 (Multimodal Assistant API Prediction) | On assistant action (API call) for K th round |
Action Accuracy, Attribute Accuracy, Action Perplexity |
Subtask 2 (Multimodal Assistant Response Generation) | On assistant utterance generation for K th round |
* Generative category: BLEU-4 * Retrieval category: MRR, R@1, R@5, R@10, Mean Rank |
Subtask 3 (Multimodal Dialog State Tracking) | On dialog state based on user utterances from 1 through K |
Slot F1, Intent F1 |
Separate winners will be announced for each subtask based on the respective performance, with the exception of subtask 2 (response generation) that will have two winners based on two categories -- generative metrics and retrieval metrics.
Rules to select the winner for each subtask (and categories) are given below:
-
For each subtask, we enforce a priority over the respective metrics (shown above) to highlight the model behavior desired by this challenge
-
The entry with the most favorable (higher or lower) performance on the metric will be labelled as a winner candidate. Further, all other entries within one standard error of this candidate’s performance will also be considered as candidates. If there are more than one candidate according to the metric, we will move to the next metric in the priority list and repeat this process until we have a single winner candidate, which would be declared as the "subtask winner".
-
In case of multiple candidates even after running through the list of metrics in the priority order, all of them will be declared as "joint subtask winners".
NOTE: Only entries that are able to open-sourced their code will be considered for the final evaluation. In all other cases, we can only give “honorable mentions” depending on the devtest performance and cannot declare them as winners of any subtask.
Participants must submit the model prediction results in JSON format that can be scored with the automatic scripts provided for that sub-task. Specifically, please name your JSON output as follows (format for subtask1 and 2 is given in the respective READMEs):
<Subtask 1>
dstc9-simmc-teststd-{domain}-subtask-1.json
<Subtask 2>
dstc9-simmc-teststd-{domain}-subtask-2-generation.json
dstc9-simmc-teststd-{domain}-subtask-2-retrieval.json
<Subtask 3>
dstc9-simmc-teststd-{domain}-subtask-3.txt (line-separated output)
or
dstc9-simmc-teststd-{domain}-subtask-3.json (JSON format)
The SIMMC organizers will then evaluate them internally using the following scripts:
<Subtask 1>
python tools/action_evaluation.py \
--action_json_path={PATH_TO_API_CALLS} \
--model_output_path={PATH_TO_MODEL_PREDICTIONS} \
--single_round_evaluation
<Subtask 2 Generation>
python tools/response_evaluation.py \
--data_json_path={PATH_TO_GOLD_RESPONSES} \
--model_response_path={PATH_TO_MODEL_RESPONSES} \
--single_round_evaluation
<Subtask 2 Retrieval>
python tools/retrieval_evaluation.py \
--retrieval_json_path={PATH_TO_GROUNDTRUTH_RETRIEVAL} \
--model_score_path={PATH_TO_MODEL_CANDIDATE_SCORES} \
--single_round_evaluation
<Subtask 3>
(line-by-line evaluation)
python -m gpt2_dst.scripts.evaluate \
--input_path_target={PATH_TO_GROUNDTRUTH_TARGET} \
--input_path_predicted={PATH_TO_MODEL_PREDICTIONS} \
--output_path_report={PATH_TO_REPORT}
(Or, dialog level evaluation)
python -m utils.evaluate_dst \
--input_path_target={PATH_TO_GROUNDTRUTH_TARGET} \
--input_path_predicted={PATH_TO_MODEL_PREDICTIONS} \
--output_path_report={PATH_TO_REPORT}
Before Sept 28th 2020 | Each Team | Each participating team should create a repository, e.g. in github.com, that can be made public under a permissive open source license (MIT License preferred). Repository doesn’t need to be publicly viewable at that time. |
Before Sept 28th tag a repository commit that contains both runable code and model parameter files that are the team’s entries for all sub-tasks attempted. | ||
Tag commit with `dstc9-simmc-entry`. | ||
Models (model parameter files) and code should have associated date-time stamps which are before Sept 27 23:59:59 anywhere on Earth. | ||
Sept 28th 2020 | SIMMC Organizers | Test-Std data released (during US Pacific coast working hours). |
Before Oct 5th 2020 | Each Team | Generate test data predictions using the code & model versions tagged previously with `dstc9-simmc-entry`. |
For each sub-task attempted, create a PR and check-in to the team’s repository where:
|
||
By Oct 5th 2020 | Each Team | Make the team repository public under a permissive Open Source license (MIT license is prefered). |
Email the SIMMC Organizers a link to the repository at [email protected] | ||
Oct 5th - Oct 12th 2020 | SIMMC Organizers | SIMMC organizers to validate sub-task results. |
Oct 12th 2020 | SIMMC Organizers | Publish anonymized team rankings on the SIMMC track github and email each team their anonymized team identity. |
Post Oct 12th 2020 | SIMMC Organizers | Our plan is to write up a challenge summary paper. In this we may conduct error analysis of the results and may look to extend, e.g. possibly with human scoring, the submitted results. |