Skip to content

Gaia data drectory structure

Zarquan edited this page May 12, 2021 · 1 revision

A directory structure for the Gaia data to handle multiple ways of partitioning the data in parquet files.

Top level directories - one for each partitioning structure.

 data
   |
   \-- gaia
         |
         +-- GEDR3_2048
         |
         \-- GEDR3_4096

Within that is a directory for the main Gaia source table

 data
   |
   \-- gaia
         |
         +-- GEDR3_2048
         |      |
         |      \-- GEDR3_2048_GAIASOURCE
         |
         \-- GEDR3_4096
                |
                \-- GEDR3_2048_GAIASOURCE

and then the corresponding neighbour tables next to the Gaia source tables

 data
   |
   \-- gaia
         |
         +-- GEDR3_2048
         |      |
         |      +-- GEDR3_2048_GAIASOURCE
         |      |
         |      +-- GEDR3_2048_2MASSPSC_BEST_NEIGHBOURS
         |      |
         |      +-- GEDR3_2048_ALLWISE_BEST_NEIGHBOURS
         |      |
         |      \-- GEDR3_2048_PS1_BEST_NEIGHBOURS
         |
         \-- GEDR3_4096
                |
                +-- GEDR3_4096_GAIASOURCE
                |
                +-- GEDR3_4096_2MASSPSC_BEST_NEIGHBOURS
                |
                +-- GEDR3_4096_ALLWISE_BEST_NEIGHBOURS
                |
                \-- GEDR3_4096_PS1_BEST_NEIGHBOURS

The naming convention is deliberately verbose, repeating the partition count at each level, because it helps us keep track of what each directory contains.

We can create a more concise directory structure for our end users.

 data
   |
   +-- gaia
         |
         \-- GEDR3
                |
                +-- GEDR3_GAIASOURCE
                |
                +-- GEDR3_2MASSPSC_BEST_NEIGHBOURS
                |
                +-- GEDR3_ALLWISE_BEST_NEIGHBOURS
                |
                \-- GEDR3_PS1_BEST_NEIGHBOURS

Which uses symlinks to point to specific versions underneath

 data
   |
   +-- gaia
         |
         \-- GEDR3
                |
                +-- GEDR3_GAIASOURCE -> ../GEDR3_4096/GEDR3_4096_GAIASOURCE
                |
                +-- GEDR3_2MASSPSC_BEST_NEIGHBOURS -> ../GEDR3_4096/GEDR3_4096_2MASSPSC_BEST_NEIGHBOURS
                |
                +-- GEDR3_ALLWISE_BEST_NEIGHBOURS  -> ../GEDR3_4096/GEDR3_4096_ALLWISE_BEST_NEIGHBOURS
                |
                \-- GEDR3_PS1_BEST_NEIGHBOURS      -> ../GEDR3_4096/GEDR3_4096_PS1_BEST_NEIGHBOURS

Using symlinks like this enables us to update which versions of the data is used without having to update the directory paths in the notebooks.

The directory and symlink structure is created automatically during deployment, making it reliable and repeatable.