Skip to content

Latest commit

 

History

History
executable file
·
8 lines (5 loc) · 1022 Bytes

File metadata and controls

executable file
·
8 lines (5 loc) · 1022 Bytes

Wikipedia Dump Dataset

We provide a script fetch_data.sh to download the pre-processed wikipedia dump. Alternatively, you can manually download the file from this Google Drive link and upzip it to obtain the text corpus.
Note: If you have run eval_sim.sh from the root directory, the wikipedia dump should have already been automatically downloaded under this directory, and you will not need to run fetch_data.sh or manually download the dataset.

Dataset Description

The dataset is retrieved from the wikipedia database dump on 2019.05. The corpus is pre-processed using Stanford CoreNLP tookit for sentence tokenization. Each line of the text file contains one wikipedia paragraph. The zipped file (to be downloaded) is of ~4GB. After unzipping the file, the corpus size is ~13GB.