WiViCo | Wikipedia Vikidia Corpus

A general-purpose parallel sentence simplification dataset for French

General Presentation & Repo Structure:

This repository provides a general-purpose complex-simpler parallel sentence simplification dataset for French language: Wikipedia-Vikidia Corpus, WiViCo. It results from the development of a two-step automatic filtering method, that mines register-diversified comparable corpora so as to extract complex-simpler pairs. To do so, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid:

preservation of the original meaning, that we addressed with the use of n:m-aware SBERT-based cosine similarities; and
simpliciy gain with respect to the source text, that we treated with a text simplicity classification model.

This repository currently contains two different versions:

The wivico_v.1 subfolder. It comprises the initial version of the dataset, by which we operated the aforementioned conditions with the use of n:m-aware SBERT-based cosine similarities (as a proxy to meaning retention) and an FFNN-based simplicity gain classifier. It results from the experiments conducted in the following article:

@inproceedings{ormaechea-2023-extracting-simplification-pairs,
    title = {Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method},
    author = {Lucía Ormaechea and Nikos Tsourakis},
    booktitle = {Proceedings of the Swiss Text Analytics Conference 2023},
    month = {6},
    year = {2023},
    location = {Neuchâtel, Switzerland},
    publisher = {ACL},
    url = {https://archive-ouverte.unige.ch/unige:169798}
}

The wivico_v.2 subfolder, that includes the newest the version of WiViCo. The data derives from SBERT-based cosine similarities to assess meaning preservation, but it uses a finer-grained method to capture complex-simpler sentence pairs than the one used in the first version. It results from the experiments performed in the following paper:

@inproceedings{ormaechea-2023-simple-simpler-beyond,
    title = {Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification},
    author = {Lucía Ormaechea, Nikos Tsourakis, Didier Schwab, Pierrette Bouillon and Benjamin Lecouteux},
    booktitle = {Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNSLP)},
    month = {12},
    year = {2023},
    location = {Trento, Italy},
    publisher = {ACL},
    url = {To appear},
}

Authors

Contact person: Lucía Ormaechea, lucia.ormaecheagrijalba@unige.ch

If you have further questions, don't hesitate to send us an email.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WiViCo | Wikipedia Vikidia Corpus

A general-purpose parallel sentence simplification dataset for French

General Presentation & Repo Structure:

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

WiViCo | Wikipedia Vikidia Corpus

A general-purpose parallel sentence simplification dataset for French

General Presentation & Repo Structure:

Authors