-
Notifications
You must be signed in to change notification settings - Fork 1
/
multigec.yaml
90 lines (83 loc) · 3.9 KB
/
multigec.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
name:
swe: MultiGEC
eng: MultiGEC
short_description:
swe: MultiGEC är en datamängd för Grammatical Error Correction (uppgift inom NLP) och innehåller parallel data för 10 språk och 15 delkorpusar (ytterligare 2 språk kan fås på begäran). Varje delkorpus består av två eller fler varianter av samma texter (oftast uppsatser som skrivs av språkinlärare), där en version (orig) har skrivits av en författare (elev, student, etc.) och de andra versionerna (ref1, ref2, ...) är korrigerade versioner av samma text. Språk som ingår: tjeckiska, estniska, tyska, grekiska, isländska, italienska, lettiska, slovenska, svenska och ukrainska (engelska och ryska kan fås på begäran). Texter kommer från olika ursprungskorpusar, men har genomgått omformattering för att ha en gemensam format.
eng: MultiGEC is a dataset for Grammatical Error Correction task containing parallel data for 10 languages and 15 subcorpora (two more languages can be obtained by on request). Each subcorpus contains two or more parallel versions of texts (typically, full learner essays), where one version (orig) is the one that the author originally wrote, and the others (ref1, ref2, ...) are corrected versions of the same text. Languages included: Czech, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish and Ukrainian (English and Russian are available on request). Texts come from different original corpora, but are reformatted to a unified format.
type: corpus
trainingdata: true
unlisted: false
successors: []
collection: false
resources:
- cs-natform
- cs-natwebinf
- cs-romani
- cs-seclearn
- de-merlin
- el-glcii
- et-eic
- et-ekil2
- is-IceEC
- is-IceL2EC
- it-merlin
- lv-lava
- sl-solar_eval
- sv-swell_gold
- uk-ua_gec
language_codes:
- ces
- deu
- ell
- est
- isl
- ita
- lav
- slv
- swe
- ukr
downloads: []
interface:
- access: https://lt3.ugent.be/resources/multigec-2025-shared-task/
licence: subject to Terms of Use
restriction: attribution, no-redistribution, no use with the proprietary models, no commercial use, personal access
contact_info:
name: Elena Volodina | Orphée DeClercq
email: [email protected]
affiliation:
organisation: Språkbanken Text | Ghent University
email: [email protected]
annotation:
swe: ''
eng: Texts are manually normalized (i.e. corrected to produce a new corrected version). No additional annotation has been performed or preserved from the source corpora. For three languages, tokenization is available in the first version (Icelandic, Russian, German), but will be removed in new releases.
keywords:
- grammatical error correction
- language learning
- essays
- multilinguality
caveats:
swe: ''
eng: The data is relatively homogeneous, consisting of full-text second language essays and their corrections. However, for some languages, native or heterogeneous data is used; and in certain languages the data does not contain full-text essays, but fragments of texts. Details on these aspects are provided on a dedicated webpage: https://spraakbanken.gu.se/en/compsla/multigec-dataset
other_references:
- '[INFO]: https://spraakbanken.gu.se/en/compsla/multigec-dataset'
- '[HOW TO CITE 1]: {publication to_be_added}'
- '[HOW TO CITE 2]: {publication to_be_added}'
intended_uses:
swe: ''
eng: Grammatical Error Correction, (Second) Language Acquisiton studies, Learner Corpus Research, Noisy User-produced Data
description:
swe: ''
eng: |-
<p>MultiGEC dataset descriptionp>
<p>There are three subcorpora in the SweLL-pilot collection:</p>
<ul>
<li></li>
<li></li>
<li></li>
</ul>
<h2>Links</h2>
<ul>
<li><a href=""></a></li>
<li><a href=""></a></li>
</ul>
doi: