forked from gleanerio/scheduler
-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Colton Loftus edited this page Sep 3, 2024
·
15 revisions
Scheduler IoW Notes
- build files (
implnet_{jobs,ops}_*.py
) are tracked in the repo, making a verbose git history and PRs more work to review - multiple organizations have stored configurations in the repo, causing a higher burden on maintainers
- build is done with environment variables and multiple coupled components instead of one build script, making it more challenging to debug, test, and refactor
-
Build the
gleanerconfig.yml
- This config builds upon a
gleanerconfigPREFIX.yaml
file that is the base template
- This config builds upon a
-
Source the
nabuconfig.yaml
which specifies configuration and context for how to retrieve triplet data and how to store it in minio -
Generate the
jobs/
ops/
sch/
andrepositories/
directories which container the Python files that describe when to run the job -
Generate the
workspace.yaml
file- Some configurations of the
workspace.yaml
file include agrpc_server
key. Other just describe the relative path for the Python file which contains references to all the jobs - This might be able to be eliminated or condense into the other config when refactoring
- Some configurations of the
-
Set up the docker swarm configuration using
dagster_setup_docker.sh
- Create the docker network
- Create the volume and read in the
gleanerconfig.yaml
workspace.yaml
andnabuconfig.yaml
NOTE: After this point the configuration and docker compose has a significant number of env vars, configuration options, and merged configurations that make the proceeding steps a bit unclear
- Run the docker compose project
- Source the
.env
file to hold env variables and pass these into the compose project - Ensure all the config files are contained inside the container
- Check if there is a compose override
.yml
file and if so, pass it in
- Source the
- This docker compose project will manage:
- traefik for a proxy to access container resources
- dagster for scheduling crawls. This in turn manages the following:
-
postgres
appears to be just for storing internal data -
dagit
appears to be the config for the actual crawl itself (i.e. uses theGLEANERIO_*
env vars. -
daemon
appears to source the base config for dagster -
code-tasks
andcode-project
seem to be grpc endpoints for interacting with dagster (NOTE: I am a bit unclear on their usage)
-
- the s3 provider (minio in this case), gleaner, and nabu for crawling / storing data
- Once crawling is scheduled and completed, I am assuming that the resulting triples will be output in the specified s3 bucket
- Condense code into one central Python build program
- Use https://github.com/docker/docker-py to control the containers instead of shell scripts. (Makes it easier to test and debug to have it all in one language as a data pipeline)
- By using a cli library like https://typer.tiangolo.com/ we can validate argument correctness and fail early, making it easier to debug instead of reading in the arguments and failing after containers are spun up
- Move all build files to the root of the repo to make it more clear for end users
- (i.e. makefiles,
build/
directory, etc.)
- (i.e. makefiles,
- Refactor such that individual organizations store their configuration outside the repo.
- The Python build program should be able to read the configuration files at an arbitrary path that the user specifices
- Add types and doc strings for easier maintenance long term
- Use jinja templating instead of writing raw text to the output files
- Currently jobs are ran by generated by outputting literal function names templated inside a Python file
- Unclear if this is scalable to huge datasets. Probably best to use a generator so we do not need to load everything into the ast
- Create a more clear documentation website