Shared task website: https://sites.google.com/view/vardial-2025/shared-tasks
All training and development data of the xSID-0.6 dataset may be used for training.
In addition, we provide a machine-translated version of the English training corpus to Norwegian, with projected annotations: nb.projectedTrain.conll.fixed
. The quality of both the translation and the annotation projection is relatively poor. Participants wishing to improve the annotation projection may find it helpful to use the existing xSID code.
The development data can be found in norsid_dev.conll
.
Example:
# id = 33/8
# text = Kor varmt skal det ver i dag?
# intent = weather/find
# dialect = V
1 Kor weather/find O
2 varmt weather/find B-weather/attribute
3 skal weather/find O
4 det weather/find O
5 ver weather/find O
6 i weather/find B-datetime
7 dag weather/find I-datetime
8 ? weather/find O
- The
id
field contains the sentence id (33
) and the translator id (8
), separated by/
.- Sentence ids range from 1 to 300 in the development set. All sentences with the same id are translations of each other, i.e. sentence 78/1 is just another dialectal variant of sentence 78/2.
- Translator ids range from 1 to 10 as well as B (Bokmål).
- The
text
field represents the detokenized prompt string. - The
intent
field contains the name of the intent associated with the prompt.- Some intent labels contain
/
but others don't.
- Some intent labels contain
- The
dialect
field contains the dialect label (one uppercase character).- There are four dialect labels:
V, N, T, B
(see Table below for details).
- There are four dialect labels:
- The numbered lines contain the tokens of the prompt, together with the intent label and the slot annotations.
- Some slot labels contain
/
but others don't (e.g.weather/attribute
vs.datetime
).
- Some slot labels contain
The correspondences between dialect areas and translator ids are as follows:
Translator ID | Origin | Dialect area |
---|---|---|
1 | Tromsø | N (North Norwegian) |
2 | Tromsø area | N (North Norwegian) |
3 | Trondheim | T (Trøndersk) |
4 | Trondheim | T (Trøndersk) |
5 | Sunndal | T (Trøndersk) |
6 | Ålesund area | V (West Norwegian) |
7 | Haugesund | V (West Norwegian) |
8 | Stavanger | V (West Norwegian) |
9 | Stavanger | V (West Norwegian) |
10 | Bergen | V (West Norwegian) |
B | B (Bokmål) |
The test set is expected to be released on 4 November 2024. The test set will only contain the tokenized and untokenized prompt strings.