NUS Grammatical Error Correction 2017 (SMT-based)

This repository contains scripts and models to replicate the results of the state-of-the-art (in 2017) grammatical error correction (GEC) system. The system primarily uses the statistical machine translation (SMT) framework, with neural and character level translation components.

If you use the models from this work, please cite this paper:

@InProceedings{chollampatt2017gec,
  author    = {Chollampatt, Shamil and Ng, Hwee Tou},
  title     = {Connecting the Dots: Towards Human-level Grammatical Error Correction},
  booktitle = {Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics}
}

Prerequistes

cmph - For compact phrase tables
nplm - For neural network joint model feature in Moses.
Moses Release 3.0
- Execute ./download.sh in software/ directory to download the specific revision of Moses used in this work.
- Moses needs to be compiled with ---with-cmph=/path/to/cmph-2.0 --with-nplm=/path/to/nplm --max-kenlm-order=9 this flag

Downloading the Models

Go to models/ and run download.sh download the following model files:

Language Models (lm/)

Model	Filename	Size	Description
Train LM	`concat-train.trie`	128 MB	5-gram LM trained using the target side of the
Wiki LM	`wiki.trie`	6 GB	5-gram LM trained using English Wikipedia dump (1.78B tokens)
94B CCLM	`94Bcclm.trie`	153 GB	5-gram LM trained using subset of CommonCrawl corpus (94B tokens)
94B CCLM	`94Bcclm.filtered.trie`	34 GB	94B CCLM filtered according to the phrase-table and the CoNLL-2014 and the development set (faster loading)
Character LM	`charLM.bin`	1 MB	Character-level LM for character-level SMT

Phrase Tables (pt/)

Model	Filename	Size	Description
WordSMT-GEC PT	`phrase-table.minphr`	851 MB	Phrase table used by word-level SMT
CharSMT-GEC PT	`phrase-table.char.gz`	12 MB	Phrase table used by character-level SMT

Neural Network Joint Model (NNJM) (nnjm/)

Model	Filename	Size	Description
Adapted NNJM	`nnjm_adapted.weights`	268 MB	Model file with trained weights in NPLM format (compatible with MOSES)
Vocab	`output.vocab`, `source.vocab`, `target.vocab`	370 KB	Vocabulary files

Operation Sequence Model (OSM) (osm/)

Model Filename Size Description

Binarized OSM operationLM.bim 323 MB Binarized operation sequence model
Word Class Language Model (WCLM) (wclm/)

Model Filename Size Description

Wiki Word Class LM wiki.classes.kenlm 13 GB 9-gram word

Word2Class Mapping wiki.classes.gen.gz 7.6 MB Binarized operation sequence model
Sparse Weights (sparse/)

Model Filename Size Description

Sparse Weights word.moses.ini.sparse 5.6 MB Sparse feature weights with 1 word context.

Running the system

To run the system on any test set, execute the following script on the tokenized input file.

./run.sh path/to/input/file /path/to/output/directory

License

The code and models in this repository are licensed under the GNU General Public License Version 3.

For commercial use of this code and models, separate commercial licensing is also available. Please contact:

Shamil Chollampatt ([email protected])
Hwee Tou Ng ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
models		models
outputs		outputs
scripts		scripts
software		software
LICENSE		LICENSE
README.md		README.md
paths.sh		paths.sh
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NUS Grammatical Error Correction 2017 (SMT-based)

Prerequistes

Downloading the Models

Running the system

License

About

Releases

Packages

Languages

Model	Filename	Size	Description
Wiki Word Class LM	`wiki.classes.kenlm`	13 GB	9-gram word
Word2Class Mapping	`wiki.classes.gen.gz`	7.6 MB	Binarized operation sequence model

License

nusnlp/smtgec2017

Folders and files

Latest commit

History

Repository files navigation

NUS Grammatical Error Correction 2017 (SMT-based)

Prerequistes

Downloading the Models

Running the system

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages