Skip to content

Commit

Permalink
Check in full 100 GB regular workload results
Browse files Browse the repository at this point in the history
  • Loading branch information
geoffxy committed Oct 6, 2023
1 parent e0a5144 commit 61fb6cd
Show file tree
Hide file tree
Showing 4 changed files with 5,006 additions and 0 deletions.
13 changes: 13 additions & 0 deletions tools/query_dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Dataset preprocessing pipeline

- Gather data using the existing `run_cost_model.py`
- Run query parsing using the existing `run_cost_model.py` to get the "parsed dataset"
- (If needed) Use `merge_collected.py` to merge the parsed results (e.g., if you
did multiple data collection passes)
- Use `unify.py` to group up the collected data into one "standard dataset"
- Use `dataset_selection.py` to do a train-test split (note that this script
needs manual modification)
- If this is a key dataset used in the evaluation, commit it under `workloads`
(see `IMDB_100GB`).
- Use `prepare_datasets.sh` to massage the data into the right format for
running with the existing GNN model training script
Binary file not shown.
Loading

0 comments on commit 61fb6cd

Please sign in to comment.