We provide a script fetch_data.sh
to download the pre-processed wikipedia dump. Alternatively, you can manually download the file from this Google Drive link and upzip it to obtain the text corpus.
Note: If you have run eval_sim.sh
from the root directory, the wikipedia dump should have already been automatically downloaded under this directory, and you will not need to run fetch_data.sh
or manually download the dataset.
The dataset is retrieved from the wikipedia database dump on 2019.05. The corpus is pre-processed using Stanford CoreNLP tookit for sentence tokenization. Each line of the text file contains one wikipedia paragraph. The zipped file (to be downloaded) is of ~4GB. After unzipping the file, the corpus size is ~13GB.