Skip to content

Latest commit

 

History

History
73 lines (59 loc) · 4.19 KB

README.md

File metadata and controls

73 lines (59 loc) · 4.19 KB

WiViCo | Wikipedia Vikidia Corpus

A general-purpose parallel sentence simplification dataset for French


     


General Presentation & Repo Structure:

This repository provides a general-purpose complex-simpler parallel sentence simplification dataset for French language: Wikipedia-Vikidia Corpus, WiViCo. It results from the development of a two-step automatic filtering method, that mines register-diversified comparable corpora so as to extract complex-simpler pairs. To do so, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid:

  • preservation of the original meaning, that we addressed with the use of n:m-aware SBERT-based cosine similarities; and
  • simpliciy gain with respect to the source text, that we treated with a text simplicity classification model.

This repository currently contains two different versions:

  • The wivico_v.1 subfolder. It comprises the initial version of the dataset, by which we operated the aforementioned conditions with the use of n:m-aware SBERT-based cosine similarities (as a proxy to meaning retention) and an FFNN-based simplicity gain classifier. It results from the experiments conducted in the following article:

    @inproceedings{ormaechea-2023-extracting-simplification-pairs,
        title = {Extracting sentence simplification pairs from French comparable corpora using a two-step filtering method},
        author = {Lucía Ormaechea and Nikos Tsourakis},
        booktitle = {Proceedings of the Swiss Text Analytics Conference 2023},
        month = {6},
        year = {2023},
        location = {Neuchâtel, Switzerland},
        publisher = {ACL},
        url = {https://archive-ouverte.unige.ch/unige:169798}
    }
  • The wivico_v.2 subfolder, that includes the newest the version of WiViCo. The data derives from SBERT-based cosine similarities to assess meaning preservation, but it uses a finer-grained method to capture complex-simpler sentence pairs than the one used in the first version. It results from the experiments performed in the following paper:

    @inproceedings{ormaechea-2023-simple-simpler-beyond,
        title = {Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification},
        author = {Lucía Ormaechea, Nikos Tsourakis, Didier Schwab, Pierrette Bouillon and Benjamin Lecouteux},
        booktitle = {Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNSLP)},
        month = {12},
        year = {2023},
        location = {Trento, Italy},
        publisher = {ACL},
        url = {To appear},
    }

Authors

Contact person: Lucía Ormaechea, [email protected]

If you have further questions, don't hesitate to send us an email.