Wikipedia Dump Dataset

We provide a script fetch_data.sh to download the pre-processed wikipedia dump. Alternatively, you can manually download the file from this Google Drive link and upzip it to obtain the text corpus.
Note: If you have run eval_sim.sh from the root directory, the wikipedia dump should have already been automatically downloaded under this directory, and you will not need to run fetch_data.sh or manually download the dataset.

Dataset Description

The dataset is retrieved from the wikipedia database dump on 2019.05. The corpus is pre-processed using Stanford CoreNLP tookit for sentence tokenization. Each line of the text file contains one wikipedia paragraph. The zipped file (to be downloaded) is of ~4GB. After unzipping the file, the corpus size is ~13GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Wikipedia Dump Dataset

Dataset Description

Files

README.md

Latest commit

History

README.md

File metadata and controls

Wikipedia Dump Dataset

Dataset Description