-
Notifications
You must be signed in to change notification settings - Fork 5
Gaia data drectory structure
A directory structure for the Gaia data to handle multiple ways of partitioning the data in parquet files.
Top level directories - one for each partitioning structure.
data
|
\-- gaia
|
+-- GEDR3_2048
|
\-- GEDR3_4096
Within that is a directory for the main Gaia source table
data
|
\-- gaia
|
+-- GEDR3_2048
| |
| \-- GEDR3_2048_GAIASOURCE
|
\-- GEDR3_4096
|
\-- GEDR3_2048_GAIASOURCE
and then the corresponding neighbour tables next to the Gaia source tables
data
|
\-- gaia
|
+-- GEDR3_2048
| |
| +-- GEDR3_2048_GAIASOURCE
| |
| +-- GEDR3_2048_2MASSPSC_BEST_NEIGHBOURS
| |
| +-- GEDR3_2048_ALLWISE_BEST_NEIGHBOURS
| |
| \-- GEDR3_2048_PS1_BEST_NEIGHBOURS
|
\-- GEDR3_4096
|
+-- GEDR3_4096_GAIASOURCE
|
+-- GEDR3_4096_2MASSPSC_BEST_NEIGHBOURS
|
+-- GEDR3_4096_ALLWISE_BEST_NEIGHBOURS
|
\-- GEDR3_4096_PS1_BEST_NEIGHBOURS
The naming convention is deliberately verbose, repeating the partition count at each level, because it helps us keep track of what each directory contains.
We can create a more concise directory structure for our end users.
data
|
+-- gaia
|
\-- GEDR3
|
+-- GEDR3_GAIASOURCE
|
+-- GEDR3_2MASSPSC_BEST_NEIGHBOURS
|
+-- GEDR3_ALLWISE_BEST_NEIGHBOURS
|
\-- GEDR3_PS1_BEST_NEIGHBOURS
Which uses symlinks to point to specific versions underneath
data
|
+-- gaia
|
\-- GEDR3
|
+-- GEDR3_GAIASOURCE -> ../GEDR3_4096/GEDR3_4096_GAIASOURCE
|
+-- GEDR3_2MASSPSC_BEST_NEIGHBOURS -> ../GEDR3_4096/GEDR3_4096_2MASSPSC_BEST_NEIGHBOURS
|
+-- GEDR3_ALLWISE_BEST_NEIGHBOURS -> ../GEDR3_4096/GEDR3_4096_ALLWISE_BEST_NEIGHBOURS
|
\-- GEDR3_PS1_BEST_NEIGHBOURS -> ../GEDR3_4096/GEDR3_4096_PS1_BEST_NEIGHBOURS
Using symlinks like this enables us to update which versions of the data is used without having to update the directory paths in the notebooks.
The directory and symlink structure is created automatically during deployment, making it reliable and repeatable.