This repository provides a general-purpose complex-simpler parallel sentence simplification dataset for French language: Wikipedia-Vikidia Corpus, WiViCo. It results from the development of a two-step automatic filtering method, that mines register-diversified comparable corpora so as to extract complex-simpler pairs. To do so, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid:
- preservation of the original meaning, that we addressed with the use of n:m-aware SBERT-based cosine similarities; and
- simpliciy gain with respect to the source text, that we treated with a text simplicity classification model.
This repository currently contains two different versions:
-
The
wivico_v.1
subfolder. It comprises the initial version of the dataset, by which we operated the aforementioned conditions with the use of n:m-aware SBERT-based cosine similarities (as a proxy to meaning retention) and an FFNN-based simplicity gain classifier. It results from the experiments conducted in the following article:@inproceedings{ormaechea-2023-extracting-simplification-pairs, title = {Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method}, author = {Lucía Ormaechea and Nikos Tsourakis}, booktitle = {Proceedings of the Swiss Text Analytics Conference 2023}, month = {6}, year = {2023}, location = {Neuchâtel, Switzerland}, publisher = {ACL}, url = {https://archive-ouverte.unige.ch/unige:169798} }
-
The
wivico_v.2
subfolder, that includes the newest the version of WiViCo. The data derives from SBERT-based cosine similarities to assess meaning preservation, but it uses a finer-grained method to capture complex-simpler sentence pairs than the one used in the first version. It results from the experiments performed in the following paper:@inproceedings{ormaechea-2023-simple-simpler-beyond, title = {Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification}, author = {Lucía Ormaechea, Nikos Tsourakis, Didier Schwab, Pierrette Bouillon and Benjamin Lecouteux}, booktitle = {Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNSLP)}, month = {12}, year = {2023}, location = {Trento, Italy}, publisher = {ACL}, url = {To appear}, }
Contact person: Lucía Ormaechea, [email protected]
If you have further questions, don't hesitate to send us an email.