WikiDBs: A large-scale corpus of relational databases from wikidata

This repository contains the code for WikiDBs (https://wikidbs.github.io/), a corpus of relational databases based on Wikidata (https://www.wikidata.org/).

Setup

Part 1: Setup MongoDB via Docker

The databases will be created based on the Wikidata JSON export. For efficient querying the data will be stored in MongoDB.

Make sure that Docker is installed on your system (otherwise install it) docker --version
Download the MongoDB docker image from DockerHub docker pull mongo
Ensure that the image has been installed docker images
Create a mongo-data and a mongo-config folder to save the data and configuration files in
Adapt the docker-compose.yaml file found in the mongodb folder of this repository to your system (container name, paths to folders, user id)
Start MongoDB by running

docker-compose up mongo

Part 2: Create a virtual environment and install the requirements.txt

python -m virtualenv <env-name>

then activate your environment and run:

python -m pip install -r requirements.txt

as well as

python -m pip install --editable .

Part 3: Load Wikidata into the MongoDB

We provide two options, the first one (3a) is to import our pre-processed MongoDB export archive file into your MongoDB instance, which will save a lot of time and effort. If you want to do all the necessary steps from scratch yourself, refer to option 3b:

3a: Load our pre-processed Wikidata MongoDB export

Copy the wikidata_mongodb_archive.gz (~ 13.2GB) file from our downloads to the mongo-data folder that you created in step 1.4. Around 40GB of disk space are required for the mongo-data folder.
Log into the MongoDB Shell by executing:

docker exec -it <container_name> bash

Import the archive by running:

 mongorestore --archive=data/db/wikidata_mongodb_archive.gz --gzip --verbose

(this takes around 60 minutes on an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz)

3b: Pre-processing from scratch

To do the pre-processing of the dump from scratch, follow these steps:

Download the Wikidata dump The dataset is based on the Wikidata json dump, the latest dump ("latest-all.json.gz") can be downloaded here: https://dumps.wikimedia.org/wikidatawiki/entities/ (needs around 115GB of disk space)

 wget https://dumps.wikimedia.org/wikidatawiki/entities/

Information page for downloads: https://www.wikidata.org/wiki/Wikidata:Database_download

Preprocess the wikidata dump To load the dump into MongoDB, adapt the settings in 'conf/preprocess.yaml' and then run the following script:

python ./scripts/preprocessing/preprocess_dump.py

This will take around 50h.

Convert the profiling dictionary into jsonl format Adapt the settings in 'conf/convert.yaml' and then run the following script:

run_exp -m "Wikidata convert profiling dict" -n 0 -- python3 ./scripts/preprocessing/convert_jsonlines.py

This will take max. 40h, depending on the settings for 'label_names_min_num_rows'.

Create Databases

Adapt the settings in our config files, especially 'conf/databases.yaml' to your needs.

We provide performance optimized scripts for the stages of crawling, renaming and postprocessing individually. On average around 5MB of disk space are necessary for each created database.

Our scripts are scalable and the number of workers for creating databases in parallel can be specified in our configuration file. On an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz we observe the following resource consumption: 1 CPU core per worker and ~25GiB RAM per worker. Each worker creates approximately 20 databases per hour on our system.

To run the pipeline:

The crawling will create databases from the Wikidata MongoDB dump

python ./scripts/crawl_databases.py

The renaming will paraphrase table and column names using the OpenAI API with batch processing (adapt conf/rename.yaml)

python ./scripts/rename_databases.py

The postprocessing will transform each database into the final output format (adapt conf/postprocess.yaml)

python ./scripts/postprocess.py

The finalize script will bring the databases in the exact for format used for the WikiDBs corpus, with the option to split them into multiple subfiles (adapt settings in scripts/finalize.py)

python ./scripts/finalize.py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
conf		conf
mongodb		mongodb
scripts		scripts
wikidbs		wikidbs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
p_lookup_df.csv		p_lookup_df.csv
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiDBs: A large-scale corpus of relational databases from wikidata

Setup

Part 1: Setup MongoDB via Docker

Part 2: Create a virtual environment and install the requirements.txt

Part 3: Load Wikidata into the MongoDB

3a: Load our pre-processed Wikidata MongoDB export

3b: Pre-processing from scratch

Create Databases

About

Releases

Packages

Languages

License

DataManagementLab/wikidbs-public

Folders and files

Latest commit

History

Repository files navigation

WikiDBs: A large-scale corpus of relational databases from wikidata

Setup

Part 1: Setup MongoDB via Docker

Part 2: Create a virtual environment and install the requirements.txt

Part 3: Load Wikidata into the MongoDB

3a: Load our pre-processed Wikidata MongoDB export

3b: Pre-processing from scratch

Create Databases

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages