Skip to content
This repository has been archived by the owner on Dec 1, 2024. It is now read-only.

Latest commit

 

History

History
157 lines (126 loc) · 8.21 KB

SUBMISSION_INSTRUCTIONS.md

File metadata and controls

157 lines (126 loc) · 8.21 KB

Final Evaluation

Below we describe how the participants can submit their results, and how the winner(s) will be announced.

Evaluation Dataset

Final evaluation for the SIMMC DSTC9 track will be on the test-std split, different from the devtest split. Each test instance in test-std contains only K number of rounds (not necessarily the entire dialog), where we release the user utterances from 1 to K rounds, and system utterances from 1 to K-1 utterances. Please refer to this table that lists the set of allowed inputs for each subtask.

For subtask 1, evaluation is on the assistant action (API call) for Kth round. For subtask 2, evaluation is on the assistant utterance generation for Kth round. For subtask 3, evaluation is on dialog state prediction based on user utterances from 1 through K.

For subtasks 1 and 2 there are 1.2K predictions (1 per dialogue). For subtask 3 there are mean(K) * number of dialogues predictions.

We provide:

  • devtest, in the test-std format: to give participants an early heads-up on how the test-std dataset will look like, we re-formatted the already-released devtest set in the format of the test-std file. Please ensure that your script and model are compatible and can run on fashion_devtest_dials_teststd_format_public.json and furniture_devtest_dials_teststd_format_public.json. Please note that the Evaluation Phase 1 is on the entire devtest set.

  • test-std: In the main data folder, we release the test-std dataset for Evalaution Phase 2: Please check out ./data/simmc_{domain}/{domain}_teststd_dials{_|_api_calls_|_retrieval_candidates_}public.json, and report the prediction results on those following the instructions below.

Evaluation Criteria

Subtask Evaluation Metric Priority List
Subtask 1 (Multimodal Assistant API Prediction) On assistant action (API call) for Kth round Action Accuracy, Attribute Accuracy, Action Perplexity
Subtask 2 (Multimodal Assistant Response Generation) On assistant utterance generation for Kth round * Generative category: BLEU-4
* Retrieval category: MRR, R@1, R@5, R@10, Mean Rank
Subtask 3 (Multimodal Dialog State Tracking) On dialog state based on user utterances from 1 through K Slot F1, Intent F1

Separate winners will be announced for each subtask based on the respective performance, with the exception of subtask 2 (response generation) that will have two winners based on two categories -- generative metrics and retrieval metrics.

Rules to select the winner for each subtask (and categories) are given below:

  • For each subtask, we enforce a priority over the respective metrics (shown above) to highlight the model behavior desired by this challenge

  • The entry with the most favorable (higher or lower) performance on the metric will be labelled as a winner candidate. Further, all other entries within one standard error of this candidate’s performance will also be considered as candidates. If there are more than one candidate according to the metric, we will move to the next metric in the priority list and repeat this process until we have a single winner candidate, which would be declared as the "subtask winner".

  • In case of multiple candidates even after running through the list of metrics in the priority order, all of them will be declared as "joint subtask winners".

NOTE: Only entries that are able to open-sourced their code will be considered for the final evaluation. In all other cases, we can only give “honorable mentions” depending on the devtest performance and cannot declare them as winners of any subtask.

Submission Format

Participants must submit the model prediction results in JSON format that can be scored with the automatic scripts provided for that sub-task. Specifically, please name your JSON output as follows (format for subtask1 and 2 is given in the respective READMEs):

<Subtask 1>
dstc9-simmc-teststd-{domain}-subtask-1.json

<Subtask 2>
dstc9-simmc-teststd-{domain}-subtask-2-generation.json
dstc9-simmc-teststd-{domain}-subtask-2-retrieval.json

<Subtask 3>
dstc9-simmc-teststd-{domain}-subtask-3.txt (line-separated output)
or
dstc9-simmc-teststd-{domain}-subtask-3.json (JSON format)

The SIMMC organizers will then evaluate them internally using the following scripts:

<Subtask 1>
python tools/action_evaluation.py \
    --action_json_path={PATH_TO_API_CALLS} \
    --model_output_path={PATH_TO_MODEL_PREDICTIONS} \
    --single_round_evaluation

<Subtask 2 Generation>
python tools/response_evaluation.py \
    --data_json_path={PATH_TO_GOLD_RESPONSES} \
    --model_response_path={PATH_TO_MODEL_RESPONSES} \
    --single_round_evaluation

<Subtask 2 Retrieval>
python tools/retrieval_evaluation.py \
    --retrieval_json_path={PATH_TO_GROUNDTRUTH_RETRIEVAL} \
    --model_score_path={PATH_TO_MODEL_CANDIDATE_SCORES} \
    --single_round_evaluation

<Subtask 3>
(line-by-line evaluation)
python -m gpt2_dst.scripts.evaluate \
  --input_path_target={PATH_TO_GROUNDTRUTH_TARGET} \
  --input_path_predicted={PATH_TO_MODEL_PREDICTIONS} \
  --output_path_report={PATH_TO_REPORT}

(Or, dialog level evaluation)
python -m utils.evaluate_dst \
    --input_path_target={PATH_TO_GROUNDTRUTH_TARGET} \
    --input_path_predicted={PATH_TO_MODEL_PREDICTIONS} \
    --output_path_report={PATH_TO_REPORT}

Submission Instructions and Timeline

Before Sept 28th 2020 Each Team Each participating team should create a repository, e.g. in github.com, that can be made public under a permissive open source license (MIT License preferred). Repository doesn’t need to be publicly viewable at that time.
Before Sept 28th tag a repository commit that contains both runable code and model parameter files that are the team’s entries for all sub-tasks attempted.
Tag commit with `dstc9-simmc-entry`.
Models (model parameter files) and code should have associated date-time stamps which are before Sept 27 23:59:59 anywhere on Earth.
Sept 28th 2020 SIMMC Organizers Test-Std data released (during US Pacific coast working hours).
Before Oct 5th 2020 Each Team Generate test data predictions using the code & model versions tagged previously with `dstc9-simmc-entry`.
For each sub-task attempted, create a PR and check-in to the team’s repository where:
  • The PR/check-in contains an output directory with the model output in JSON format that can be scored with the automatic scripts provided for that sub-task.
  • The PR comments contain a short technical summary of model.
  • Tag the commit with `dstc9-simmc-test-subtask-{N}`; where `{N}` is the sub-task number.
By Oct 5th 2020 Each Team Make the team repository public under a permissive Open Source license (MIT license is prefered).
Email the SIMMC Organizers a link to the repository at [email protected]
Oct 5th - Oct 12th 2020 SIMMC Organizers SIMMC organizers to validate sub-task results.
Oct 12th 2020 SIMMC Organizers Publish anonymized team rankings on the SIMMC track github and email each team their anonymized team identity.
Post Oct 12th 2020 SIMMC Organizers Our plan is to write up a challenge summary paper. In this we may conduct error analysis of the results and may look to extend, e.g. possibly with human scoring, the submitted results.