Skip to content

Commit

Permalink
Added tpc-ds example for single table for UWarwick
Browse files Browse the repository at this point in the history
  • Loading branch information
Benjamin Hilprecht committed Apr 29, 2020
1 parent 071ae9e commit 28522a2
Show file tree
Hide file tree
Showing 36 changed files with 2,671 additions and 475 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ benchmarks/maqp_scripts/rsync_dm.sh
# profiling
profiling_results
profiling.py
bar.pdf
*.lprof

optimized_inference.cpp
Expand Down
78 changes: 64 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian
![DeepDB Overview](baselines/plots/overview.png "DeepDB Overview")

# Setup
Tested with python3.7 and python3.8
```
git clone https://github.com/DataManagementLab/deepdb-public.git
cd deepdb-public
Expand All @@ -18,21 +19,12 @@ source venv/bin/activate
pip3 install -r requirements.txt
```

# How to experiment with DeepDB on a new Dataset
- Specify a new schema in the schemas folder
- Due to the current implementation, make sure to declare
- the primary key,
- the filename of the csv sample file,
- the correct table size and sample rate,
- the relationships among tables if you do not just run queries over a single table,
- any non-key functional dependencies (this is rather an implementation detail),
- and include all columns in the no-compression list by default (as done for the IMDB benchmark),
- To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
- Generate the HDF/sampled HDF files and learn the RSPN ensemble
- Use the RSPN ensemble to answer queries
- For reference, please check the commands to reproduce the results of the paper
For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run
```
pip3 install spflow --no-deps
```

# How to Reproduce Experiments in the Paper
# Reproduce Experiments

## Cardinality Estimation
Download the [Job dataset](http://homepages.cwi.nl/~boncz/job/imdb.tgz).
Expand Down Expand Up @@ -288,3 +280,61 @@ python3 maqp.py --evaluate_confidence_intervals
--confidence_upsampling_factor 100
--confidence_sample_size 10000000
```

### TPC-DS (Single Table) pipeline
As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in `./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql`. As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory `../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv`. Afterwards,
you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.

Generate hdf files from csvs
```
python3 maqp.py --generate_hdf
--dataset tpc-ds-1t
--csv_seperator |
--csv_path ../mqp-data/tpc-ds-benchmark
--hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
```

Learn the ensemble
```
python3 maqp.py --generate_ensemble
--dataset tpc-ds-1t
--samples_per_spn 10000000
--ensemble_strategy single
--hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
--ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles
--rdc_threshold 0.3
--post_sampling_factor 10
```

Compute ground truth
```
python3 maqp.py --aqp_ground_truth
--dataset tpc-ds-1t
--query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
--target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
--database_name tcpds
```

Evaluate the AQP queries
```
python3 maqp.py --evaluate_aqp_queries
--dataset tpc-ds-1t
--target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
--ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
--query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
--ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
```

# How to experiment with DeepDB on a new Dataset
- Specify a new schema in the schemas folder
- Due to the current implementation, make sure to declare
- the primary key,
- the filename of the csv sample file,
- the correct table size and sample rate,
- the relationships among tables if you do not just run queries over a single table,
- any non-key functional dependencies (this is rather an implementation detail),
- and include all columns in the no-compression list by default (as done for the IMDB benchmark),
- To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
- Generate the HDF/sampled HDF files and learn the RSPN ensemble
- Use the RSPN ensemble to answer queries
- For reference, please check the commands to reproduce the results of the paper
Loading

0 comments on commit 28522a2

Please sign in to comment.