diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index eb7817e..833463a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,7 +1,22 @@ -## Contributing +# Contributing We welcome all contributions! That said, this is a Sage Bionetworks owned project, and we use JIRA ([AG](https://sagebionetworks.jira.com/jira/software/c/projects/AG/boards/91)/[IBCDPE](https://sagebionetworks.jira.com/jira/software/c/projects/IBCDPE/boards/189)) to track any bug/feature requests. This guide will be more focussed on a Sage Bio employee's development workflow. If you are a Sage Bio employee, make sure to assign yourself the JIRA ticket if you decide to work on it. +- [Coding Style](#coding-style) +- [The Development Life Cycle](#the-development-life-cycle) + - [Install Development Dependencies](#install-development-dependencies) + - [Developing at Sage Bio](#developing-at-sage-bio) + - [Pre-Commit Hooks](#pre-commit-hooks) + - [Testing](#testing) + - [Running tests](#running-tests) + - [Test Development](#test-development) + - [Mock Testing](#mock-testing) +- [Transforms](#transforms) +- [Great Expectations](#great-expectations) + - [Custom Expectations](#custom-expectations) + - [Nested Columns](#nested-columns) +- [DockerHub](#dockerhub) + ## Coding Style The code in this package is also automatically formatted by `black` for consistency. @@ -21,7 +36,7 @@ The code in this package is also automatically formatted by `black` for consiste git pull upstream develop ``` --> -### Install development dependencies +### Install Development Dependencies Please follow the [README.md](README.md) to install the package for development purposes. Be sure you run this: @@ -167,9 +182,9 @@ This package uses [Great Expectations](https://greatexpectations.io/) to validat 1. Create a new expectation suite by defining the expectations for the dataset in a Jupyter Notebook inside the `gx_suite_definitions` folder. Use `metabolomics.ipynb` as an example. You can find a catalog of existing expectations [here](https://greatexpectations.io/expectations/). 1. Run the notebook to generate the new expectation suite. It should populate as a JSON file in the `/great_expectations/expectations` folder. -1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. After updating the config files reports should be uploaded in the proper locations ([Prod](https://www.synapse.org/#!Synapse:syn52948668), [Testing](https://www.synapse.org/#!Synapse:syn52948670)) when data processing is complete. - - You can prevent Great Expectations from running for a dataset by removing the `gx_enabled: true` from the configuration for the dataset. -1. Test data processing by running `adt test_config.yaml` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse. +1. Add support for running Great Expectations on a dataset by adding `gx_enabled: true` to the configuration for the datatset in both `test_config.yaml` and `config.yaml`. Ensure that the `gx_folder` and `gx_table` keys are present in the configuration file and contain valid Synapse IDs for the GX reports and GX table, respectively. + - You can prevent Great Expectations from running for a dataset by setting `gx_enabled: false` in the configuration for the dataset. +1. Test data processing by running `adt test_config.yaml --upload` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse. #### Custom Expectations diff --git a/README.md b/README.md index 3ca32f7..c2b0a37 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,13 @@ - [Intro](#intro) - [Running the pipeline](#running-the-pipeline) - - [Nextflow Tower](#nextflow-tower) + - [Seqera Platform](#seqera-platform) + - [Configuring Synapse Credentials](#configuring-synapse-credentials) - [Locally](#locally) - [Docker](#docker) - [Testing Github Workflow](#testing-github-workflow) - [Unit Tests](#unit-tests) -- [Config](#config) +- [Pipeline Configuration](#pipeline-configuration) ## Intro A place for Agora's ETL, data testing, and data analysis @@ -18,7 +19,7 @@ parameters defined in a config file to determine which kinds of extraction and t dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository. In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file, -and run the scripts. +and run the tool. This `src/agoradatatools` implementation was influenced by the "Modern Config Driven ELT Framework for Building a Data Lake" talk given at the Data + AI Summit of 2021. @@ -33,9 +34,9 @@ Note that running the pipeline does _not_ automatically update the Agora databas into the Agora databases is handled by [agora-data-manager](https://github.com/Sage-Bionetworks/agora-data-manager/). You can run the pipeline in any of the following ways: -1. [Nextflow Tower](#nextflow-tower) is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client. -2. [Locally](#locally) requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client. -3. [Docker](#docker) requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT. +1. **Seqera Platform**: is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client. +2. **Locally**: requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client. +3. **Docker**: requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT. When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo: * ```test_config.yaml``` places the transformed datasets in the [Agora Testing Data](https://www.synapse.org/#!Synapse:syn17015333) folder in synapse; write files to this folder to perform data validation. @@ -45,8 +46,8 @@ Note that files in the Agora Live Data folder are not automatically released, so You may also create a custom config file to use locally to target specific dataset(s) or transforms of interest, and/or to write the generated json files to a different Synapse location. See the [config file](#config) section for additional information. -### Nextflow Tower -This pipeline can be executed without any local installation, permissions, or credentials; the Sage Bionetworks Nextflow Tower workspace is configured to use Agora's Synapse credentials, which can be found in LastPass in the "Shared-Agora" Folder. +### Seqera Platform +This pipeline can be executed without any local installation, permissions, or credentials; the Sage Bionetworks Seqera Platform workspace is configured to use Agora's Synapse credentials, which can be found in LastPass in the "Shared-Agora" Folder. The instructions to trigger the workflow can be found at [Sage-Bionetworks-Workflows/nf-agora](https://github.com/Sage-Bionetworks-Workflows/nf-agora) @@ -80,7 +81,7 @@ Perform the following one-time steps to set up your local environment and obtain pipenv shell ``` -6. You can check if the package was isntalled correctly by running `adt --help` in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired [config file](#config) as an argument. Be sure to review these instructions prior to executing a processing run. The following example command will execute the pipeline using ```test_config.yaml```: +6. You can check if the package was installed correctly by running `adt --help` in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired [config file](#config) as an argument. Be sure to review these instructions prior to executing a processing run. The following example command will execute the pipeline using ```test_config.yaml``` and the default options: ```bash adt test_config.yaml @@ -119,16 +120,25 @@ Unit tests can be run by calling pytest from the command line. python -m pytest ``` -## Config +## Pipeline Configuration Parameters: - `destination`: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basis - `staging_path`: Defines the location of the staging folder that the generated json files are written to -- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to +- `gx_folder`: Defines the Synapse ID of the folder that generated GX reports are written to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_folder` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown. +- `gx_table`: Defines the Synapse ID of the table that generated GX reporting is posted to. This key must always be present in the config file. A valid Synapse ID assigned to `gx_table` is required if `gx_enabled` is set to `true` for any dataset. If this key is missing from the dataset, or if it is set to `none` when `gx_enabled` is `true` for any dataset, an error will be thrown. +- `sources/`: Source files for each dataset are defined in the `sources` section of the config file. +- `sources//_files`: A list of source file information for the dataset. +- `sources//_files/name`: The name of the source file/dataset. +- `sources//_files/id`: The Synapse ID of the source file. Dot notation is supported to indicate the version of the file to use. +- `sources//_files/format`: The format of the source file. - `datasets/`: Each generated json file is named `.json` - `datasets//files`: A list of source files for the dataset - `name`: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration) - `id`: Synapse id of the file - `format`: The format of the source file +- `datasets//final_format`: The format of the generated output file. +- `datasets//gx_enabled`: Whether or not GX validation should be run on the dataset. `true` will run GX validation, `false` or the absence of this key will skip GX validation. +- `datasets//gx_nested_columns`: A list of nested columns that should be validated using GX nested validation. Failure to include this key and a valid list of columns will result in an error because the nested fields will not be converted to a JSON-parseable string prior to validation. This key is not needed if `gx_enabled` is not set to `true` or if the dataset does not have nested fields. - `datasets//provenance`: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity") - `datasets//destination`: Override the default destination for a specific dataset by specifying a synID, or use `*dest` to use the default destination - `datasets//column_rename`: Columns to be renamed prior to data transformation diff --git a/src/agoradatatools/process.py b/src/agoradatatools/process.py index 4532908..39a60e9 100644 --- a/src/agoradatatools/process.py +++ b/src/agoradatatools/process.py @@ -309,7 +309,7 @@ def process_all_files( "LOCAL", "--platform", "-p", - help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional).", + help="Platform that is running the process. Must be one of LOCAL, GITHUB, or NEXTFLOW (Optional, defaults to LOCAL).", show_default=True, ) run_id_opt = Option( @@ -323,17 +323,17 @@ def process_all_files( False, "--upload", "-u", - help="Toggles whether or not files will be uploaded to Synapse. The absence of this option means " - "that neither output data files nor GX reports will be uploaded to Synapse. Setting " - "`--upload` in the command will cause both to be uploaded. This option is used to control " - "the upload behavior of the process.", + help="Boolean value that toggles whether or not files will be uploaded to Synapse. The absence of this option means " + "`False` - that neither output data files nor GX reports will be uploaded to Synapse. Setting " + "`--upload` in the command will cause both to be uploaded. (Optional, defaults to False)", show_default=True, ) synapse_auth_opt = Option( None, "--token", "-t", - help="Synapse authentication token. Defaults to environment variable $SYNAPSE_AUTH_TOKEN via syn.login() functionality", + help="Synapse authentication token. (Required, Defaults to environment variable SYNAPSE_AUTH_TOKEN via syn.login() functionality " + "https://python-docs.synapse.org/reference/client/?h=syn.login#synapseclient.Synapse.login)", show_default=False, )