Shared task on Multilingual Grammatical Error Detection (MultiGED-2023)

The Computational SLA working group invites you to participate in the first shared task on Multilingual Grammatical Error Detection, MultiGED, where five languages are included: Czech, English, German, Italian and Swedish.

The results will be presented on May, 22, 2023, at the NLP4CALL workshop, colocated with the NODALIDA conference to be held in the Faroe Islands.

To register, fill in this form.

Task description

In this shared task your goal is to detect tokens in need of correction, labeling them as either correct ("c") or incorrect ("i"), i.e. performing binary classification on a token level, as shown in the example below.

Token ID	Token	Label
1	I	c
2	saws	i
3	the	c
4	show	c
5	's	c
6	advertisement	c
7	hanging	c
8	up	c
9	of	i
10	a	c
11	wall	c
12	in	c
13	London	c

Tracks

There is one track only, although results will be collected and compared by each individual language. When submitting your results, please use ISO 639‑1 language prefixes to indicate which language your system was developed for and should be tested on.

cs- for Czech
en- for English
de- for German
it- for Italian
sv- for Swedish

It is possible, but not mandatory, to develop systems for each of the included languages.

Data

For each of the languages we use a corpus of second language learner essays as a source. For datasets that have been previously used for similar tasks (GED/GEC), we follow the same data splits (e.g. Czech, English-FCE, German).

Two new datasets are introduced in this shared task: one for Swedish (based on SweLL-gold corpus) and another for English (based on REALEC corpus). --Any others that are new?

The data is organized in tab-separated columns, with one token per line; each sentence is separated by an empty line. The first column contains the tokens, the second column contains the grammatical error label (c or i). Note that there are no column headers.

You are welcome to use the provided datasets across languages and mix them as you see fit (any better way of saying that?)

You can download the datasets from the links below. --Should we link the datasets to the file names in the tables below (except text, of course)? AC: I would say that there are training and validation (or 'dev' if we rename it) folders in this repository. So the easiest thing is for people to clone the whole repo.

Data licenses

Language	Corpus name	Corpus license
Czech	GECCC
English	FCE	custom
	REALEC
German	Falko	CC BY-SA 4.0
	MERLIN	CC BY 3.0
Italian	MERLIN	CC BY 3.0
Swedish	SweLL-gold

Czech

Source corpus	Split	Proportion	Nr sentences	Nr tokens	Nr errors	Error rate
GECCC	Total	1.0	35,443	365,033	88,148	0.242
	• cs-train.tsv	0.758	29,786	303,703	70,993	0.234
	• cs-dev.tsv	0.073	2,779	29,221	8,301	0.284
	• cs-test.tsv	0.080	2,878	32,109	8,854	0.276

Describe the corpus? Link to an article about the corpus?

English

There are two corpora for English: FCE and REALEC

Source corpus	Split	Proportion	Nr sentences	Nr tokens	Nr errors	Error rate
FCE	Total	1.0	33,236	561,416	53,607	0.101
	• en-train-fce.tsv	0.805	28,350	454,736	45,183	0.099
	• en-dev-fce.tsv	0.062	2,191	34,748	3,678	0.106
	• en-test-fce.tsv	0.074	2,695	41,932	4,746	0.113

The FCE Corpus was released in 2011 along with this ACL publication by Yannakoudakis et al. It contains essays by candidates for the First Certificate in English (FCE) exam (now "B2 First") designed by Cambridge English to certify learners of English at CEFR level B2. It is part of the larger Cambridge Learner Corpus and was annotated for grammatical errors as described in Nicholls (2003). The FCE Corpus has been used in grammatical error detection (and correction) experiments on numerous occasions, as summarised here.

Source corpus	Split	Proportion	Nr sentences	Nr tokens	Nr errors	Error rate
REALEC	Total	1.0	200,432	4,418,696	290,289	0.069
	• en-train-realec.tsv	0.764	160,594	3,377,609	232,290	0.069
	• en-dev-realec.tsv	0.095	19,883	419,161	28,995	0.069
	• en-test-realec.tsv	0.095	19,955	421,495	29,004	0.069

Describe the corpus? Link to an article about the corpus?

German

Source corpus	Split	Proportion	Nr sentences	Nr tokens	Nr errors	Error rate
Falko-Merlin	Total	1.0	24,077	381,134	64,054	0.168
	• de-train.tsv	0.800	19,237	305,113	51,143	0.168
	• de-dev.tsv	0.104	2,503	39,444	6,709	0.170
	• de-test.tsv	0.096	2,337	36,577	6,202	0.170

The dataset split used here is described in Boyd (2018) and has been published here.

Further information on Merlin can be found in Boyd et al. (2014) and on the Merlin platform. For Falko, please refer to the Falko website here.

Note that tokens transcribed as entirely "-unreadable-" in the Merlin corpus have been omitted (partially unreadable tokens, such as "ver-unreadable-" have been retained).

Italian

Source corpus	Split	Proportion	Nr sentences	Nr tokens	Nr errors	Error rate
Merlin	Total	1.0	8,061	99,693	13,251	0.133
	• it-train.tsv	0.789	6,393	78,638	10,376	0.132
	• it-dev.tsv	0.110	2,503	11,038	1,429	0.130
	• it-test.tsv	0.101	2,337	10,017	1,446	0.144

The Merlin corpus is described in Boyd et al. (2014). Further documentation can be found on the Merlin platform.

Note that tokens transcribed as entirely "-unreadable-" in the Merlin corpus have been omitted (partially unreadable tokens, such as "lem-unreadable-" have been retained).

Swedish

Source corpus	Split	Proportion
SweLL-gold	Total	1.0
	• sv-train.tsv
	• sv-dev.tsv
	• sv-test.tsv

The SweLL-gold corpus is described in Volodina et al. (2019) with statistics and metadata provided here.

Use of external data

We encourage you to use any external data of your choice, provided you describe it and - where possible - share with the community for potential replication studies. CB: I would say participants can use any data they wish, provided it is publicly available for research purposes.

You can also prepare your own synthetic datasets, which we also encourage you to share with the community.

Besides, you are welcome to report any errors/problems with the datasets using "issues"/"pull requests" -- SHOULD we allow this??? AC: yes good idea; if people find any data problems we'll need to remember to announce it on the google group (that there's a new version of the data).

Test data

The final model will be evaluated on a hidden test set. We will release test files per language to the registered participants by the end of February 2023. Please fill in this form to register for participation.

IS THIS CORRECT, Orphee?

Evaluation

Evaluation metric

Evaluation will be carried out in terms of token-based Precision, Recall and F0.5 to be consistent with previous work on error detection (Bell et al., 2019; Kaneko and Komachi, 2019; Yuan et al., 2021.)

Example:

Token ID	Token	Reference	Hypothesis	Meaning
1	I	c	c
2	saws	i	i	True Positive
3	the	c	c
4	show	c	i	False Positive
5	last	c	c
6	nigt	i	c	False Negative
7	.	c	c

F0.5 is used instead of F1 because humans judge false positives more harshly than false negatives and so precision is more important than recall.

Evaluation script

The provided eval.py script calculates system performance from input CoNLL-format Hypothesis and Reference tab-separated files. This script is used to evaluate any output submitted to Codalab (development/test data), but you can also download it and use it independently. The script is run from the command line as follows: Orphee to add Codalab link!

python3 eval.py -hyp <hyp_tsv> -ref <ref_tsv>

It is assumed that the hyp_tsv file is in the same format as the equivlanet ref_tsv file provided in this shared task. The script processes a single language at a time, so you will need to call it several times to evaluate multiple languages.

System submissions

All system submissions will be administered through CodaLab.

Further insructions .... Orphee?

Publication

We encourage you to submit a paper with your system description to the NLP4CALL workshop special track. All papers will be reviewed by the organizing committee. Is this right? AC: yes we'll have to review them fairly quickly as a group effort, unlikely we'll reject any papers, but we should be clear what we expect in terms of minimum requirements for a decent paper in this context.

Accepted papers will be published in the workshop proceedings through NEALT Proceedings Series and double-published through the ACL anthology.

Further instructions on this will follow.

Timeline

≈ mid-January 2022 - first call for participation. Training data released, CodaLab opens for team registrations.
≈ mid-February 2023 - validation server released online
≈ end of February 2023 - test data released
≈ 5 March 2023 - system submission deadline (system output, models, code, fact sheets, extra data, etc)
≈ mid-March 2023 - results announced
≈ 8-10 April 2023 - paper submission deadline (system descriptions)
≈ end April 2023 - paper reviews sent to the authors
≈ 10 May 2023 - camera-ready deadline
22 May 2023 - presentations of the systems at NLP4CALL workshop

Registration for the task

Teams that intend to participate, are requested to fill in this form.

Organizers

Elena Volodina, University of Gothenburg, Sweden
Chris Bryant, University of Cambridge, UK
Andrew Caines, University of Cambridge, UK
Orphee De Clercq, Ghent University, Belgium
Jennifer-Carmen Frey, EURAC Research, Italy
Elizaveta Ershova, JetBrains, Cyprus
Alexandr Rosen, Charles University, Czech Republic
Olga Vinogradova, Israel

Contact information and forum for discussions

Mail your questions to: < multiged - 2023 @ googlegroups . com >

This is a google group which you can join, ask questions, hold discussions and browse for already answered questions.

References

Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue.
Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Schöne, Barbora Štindlová, and Chiara Vettori (2014). The MERLIN corpus: Learner language and the CEFR. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 1281-1288.
Adriane Boyd (2018). Using Wikipedia edits in low resource grammatical error correction. Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pp. 79-84.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock (2011). A New Dataset and Method for Automatically Grading ESOL Texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
evaluation		evaluation
training_data		training_data
validation_data		validation_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shared task on Multilingual Grammatical Error Detection (MultiGED-2023)

Task description

Tracks

Data

Data licenses

Czech

English

German

Italian

Swedish

Use of external data

Test data

Evaluation

Evaluation metric

Evaluation script

System submissions

Publication

Timeline

Registration for the task

Organizers

Contact information and forum for discussions

References

About

Releases

Packages

Languages

nesor/multiged-2023

Folders and files

Latest commit

History

Repository files navigation

Shared task on Multilingual Grammatical Error Detection (MultiGED-2023)

Task description

Tracks

Data

Data licenses

Czech

English

German

Italian

Swedish

Use of external data

Test data

Evaluation

Evaluation metric

Evaluation script

System submissions

Publication

Timeline

Registration for the task

Organizers

Contact information and forum for discussions

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages