This software was developed for the CLEF 2022 Text Simplification task.
Our work uses the transfer learning capabilities of the T5 pre-trained language model, adding a method to control specific simplification features. We present a new feature based on masked tokens prediction (Language Model Fill-Mask) to control the lexical complexity of the text generation process. The results obtained with the SARI metric are at the same level as previous work in other domains for sentence simplification.
Steps to replicate the results:
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
- For training purpose:
Select hyperparameters in T5_train.py
python scripts/T5_train.py
- Optimization:
Select experiment_id, dataset and trials in optimization.py
python scripts/optimization.py
- For test purpose:
Select experiment_id and dataset in T5_evaluate.py
python scripts/T5_evaluate.py
Same for larger version. Be carefull with memory issues.
Download the dataset from https://simpletext-project.com/2022/clef/en/tasks. It's necessary to preprocessed the raw data using 1.Preprocessing dataset Task 3.ipynb.