Check in full 100 GB regular workload results

mitdbg · Oct 6, 2023 · 61fb6cd · 61fb6cd
1 parent e0a5144
commit 61fb6cd
Show file tree

Hide file tree

Showing 4 changed files with 5,006 additions and 0 deletions.
diff --git a/tools/query_dataset/README.md b/tools/query_dataset/README.md
@@ -0,0 +1,13 @@
+# Dataset preprocessing pipeline
+
+- Gather data using the existing `run_cost_model.py`
+- Run query parsing using the existing `run_cost_model.py` to get the "parsed dataset"
+- (If needed) Use `merge_collected.py` to merge the parsed results (e.g., if you
+  did multiple data collection passes)
+- Use `unify.py` to group up the collected data into one "standard dataset"
+- Use `dataset_selection.py` to do a train-test split (note that this script
+  needs manual modification)
+- If this is a key dataset used in the evaluation, commit it under `workloads`
+  (see `IMDB_100GB`).
+- Use `prepare_datasets.sh` to massage the data into the right format for
+  running with the existing GNN model training script
diff --git a/workloads/IMDB_100GB/regular_rebalanced_5k/data_accessed-athena-aurora.npy b/workloads/IMDB_100GB/regular_rebalanced_5k/data_accessed-athena-aurora.npy