RoGPT2: Romanian GPT2 for text generation

This is the Romanian language version of the GPT2 model. There are 3 trained versions, they are available on the HuggingFace Hub:

Training

Corpus Statistics

Corpus	Total size	Number of words	Number of sentences
OSCAR	11.54 GB	1745M	48.46M
Wiki-Ro	0.46 GB	68M	1.79M
Debates	0.5 GB	73M	3.61M
Books	4.37 GB	667M	37.39M
News	0.15 GB	23M	0.77M

Training Statistics

Version	Number of parameters	Number of epoch	Duration of an epoch	Context size	Batch size	PPL
Base	124M	15	7h	1024	72	22.96
Medium	354M	10	22h	1024	24	17.64
Large	774M	5	45h	512	16	16.77

Install Dependencies

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
wget https://nextcloud.readerbench.com/index.php/s/2jasc6H79F4ANkD/download -O dataset.zip
unzip dataset.zip
rm -fr dataset.zip
wget https://nextcloud.readerbench.com/index.php/s/94EKKmTCt9CjTXf/download -O model.zip
unzip model.zip
rm -fr model.zip

The training corpus can be found at the link.
The datasets for evaluation can be found at the link.
The downstream models can be found at the link.

Evaluation

1. MOROCO

Model	Dialect	Md to Ro	Ro to Md
KRR + SK	94.06	67.59	75.47
BERT-base-ro	95.98	69.90	78.08
RoBERT-small	95.76	69.05	80.15
RoBERT-base	97.24	68.80	82.37
RoBERT-large	97.21	69.50	83.26
RoGPT2-base	96.69	69.82	77.55
RoGPT2-medium	96.42	69.77	80.51
RoGPT2-large	96.93	71.07	82.56

2. LaRoSeDa

Model	Binary: Accuracy	Binary: F1-Score	Multi-Class: Accuracy	Multi-Class: F1-Score
BERT-base-ro	98.07	97.94	-	79.61
RoDiBERT	98.40	98.31	-	83.01
RoBERT-small	97.44	97.43	89.30	84.23
RoBERT-base	98.27	98.26	90.59	86.27
RoBERT-large	98.20	98.19	90.93	86.63
RoGPT2-base	97.89	97.88	89.65	84.68
RoGPT2-medium	98.03	98.04	90.29	85.37
RoGPT2-large	98.06	98.07	90.26	84.89

3. RoSTS

Model	Spearman dev-set	Spearman test-set	Pearson dev-set	Pearson test-set
BERT-base-ro	84.26	80.86	84.59	81.59
RoDiBERT	77.07	71.47	77.13	72.25
RoBERT-small	82.06	78.06	81.66	78.49
RoBERT-base	84.93	80.39	85.03	80.39
RoBERT-large	86.25	83.15	86.58	83.76
RoGPT2-base	83.51	79.77	83.74	80.56
RoGPT2-medium	85.75	82.25	86.04	83.16
RoGPT2-large	85.70	82.64	86.14	83.46

4. WMT16

Model	Decoder method	Ro-En	En-Ro
mBART	-	38.5	38.5
OpenNMT	-	-	24.7
RoGPT2-base	Greedy	30.37	20.27
RoGPT2-base	Beam-search-4	31.26	22.31
RoGPT2-base	Beam-search-8	31.39	22.95
RoGPT2-medium	Greedy	32.48	22.18
RoGPT2-medium	Beam-search-4	34.08	24.03
RoGPT2-medium	Beam-search-8	34.16	24.13
RoGPT2-large	Greedy	33.69	23.31
RoGPT2-large	Beam-search-4	34.40	24.23
RoGPT2-large	Beam-search-8	34.51	24.32

5. XQuAD

Model	Decoder method	EM	F1-Score
BERT-base-ro	-	47.89	63.74
RoDiBERT	-	21.76	34.57
RoBERT-small	-	30.84	45.17
RoBERT-base	-	53.52	70.04
RoBERT-large	-	55.46	69.64
mBERT	-	59.9	72.7
XLM-R Large	-	69.7	83.6
RoGPT2-base	Greedy	23.69	35.97
RoGPT2-base	Beam-search-4	24.11	35.27
RoGPT2-medium	Greedy	29.66	44.74
RoGPT2-medium	Beam-search-4	31.59	45.32
RoGPT2-large	Greedy	29.74	42.98
RoGPT2-large	Beam-search-4	29.66	43.05
RoGPT2-base-en-ro	Greedy	23.86	34.27
RoGPT2-base-en-ro	Beam-search-4	25.04	34.51
RoGPT2-medium-en-ro	Greedy	27.05	39.75
RoGPT2-medium-en-ro	Beam-search-4	27.64	39.11
RoGPT2-large-en-ro	Greedy	28.40	39.79
RoGPT2-large-en-ro	Beam-search-4	28.73	39.71
RoGPT2-large-en-ro-mask	Greedy	31.34	44.71
RoGPT2-large-en-ro-mask	Beam-search-4	31.59	43.53

6. Wiki-Ro: LM

Model	PPL dev	PPL test
BERT-base-ro	29.0897	28.0043
RoGPT2-base	34.3795	33.7460
RoGPT2-medium	23.7879	23.4581
RoGPT2-large	21.7491	21.5200

7. RoGEC

Model	Decoder mothod	P	R	F_0.5
Transformer-tiny	Beam-search	53.53	26.36	44.38
Transformer-base Finetuning	Beam-search	56.05	46.19	53.76
Transformer-base Finetuning	Beam-search-LM	50.68	45.39	49.52
Transformer-base Finetuning	Beam-search-norm-LM	51.06	45.43	49.83
RoGPT2-base	Greedy	59.02	49.35	56.80
RoGPT2-base	Beam-search-4	65.23	49.26	61.26
RoGPT2-base	Beam-search-8	65.88	49.64	61.84
RoGPT2-medium	Greedy	69.97	57.94	67.18
RoGPT2-medium	Beam-search-4	72.46	57.99	69.01
RoGPT2-medium	Beam-search-8	72.24	57.69	68.77
RoGP2-large	Greedy	61.90	49.09	58.83
RoGP2-large	Beam-search-4	65.24	49.43	61.32
RoGP2-large	Beam-search-8	64.96	49.22	61.06
RoGPT2-base*	Greedy	68.67	49.60	63.77
RoGPT2-base*	Beam-search-4	71.16	50.53	65.79
RoGPT2-base*	Beam-search-8	71.68	50.65	66.18
RoGPT2-medium*	Greedy	58.21	43.32	54.47
RoGPT2-medium*	Beam-search-4	68.31	43.78	61.43
RoGPT2-medium*	Beam-search-8	68.68	43.99	61.75
RoGPT2-large*	Greedy	64.86	41.30	58.22
RoGPT2-large*	Beam-search-4	65.57	41.00	58.55
RoGPT2-large*	Beam-search-8	65.44	41.09	58.50

Note: * the models were trained using the dataset of 3,000,000 artificially generated pairs

Practical Application

Continuation and Title generation

Acknowledgments

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

How to cite

@inproceedings{niculescu2021rogpt2,
title={RoGPT2: Romanian GPT2 for Text Generation},
author={Niculescu, Mihai Alexandru and Ruseti, Stefan and Dascalu, Mihai},
booktitle={2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI)},
pages={1154--1161},
year={2021},
organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
generate		generate
log		log
src		src
utils		utils
LICENSE		LICENSE
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoGPT2: Romanian GPT2 for text generation

Training

Corpus Statistics

Training Statistics

Install Dependencies

Evaluation

1. MOROCO

2. LaRoSeDa

3. RoSTS

4. WMT16

5. XQuAD

6. Wiki-Ro: LM

7. RoGEC

Practical Application

Continuation and Title generation

Acknowledgments

How to cite

About

Releases

Packages

Contributors 3

Languages

License

readerbench/RoGPT2

Folders and files

Latest commit

History

Repository files navigation

RoGPT2: Romanian GPT2 for text generation

Training

Corpus Statistics

Training Statistics

Install Dependencies

Evaluation

1. MOROCO

2. LaRoSeDa

3. RoSTS

4. WMT16

5. XQuAD

6. Wiki-Ro: LM

7. RoGEC

Practical Application

Continuation and Title generation

Acknowledgments

How to cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages