Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile validations with ingest-api #113

Open
1 task done
anayeaye opened this issue Feb 29, 2024 · 2 comments
Open
1 task done

Reconcile validations with ingest-api #113

anayeaye opened this issue Feb 29, 2024 · 2 comments
Assignees

Comments

@anayeaye
Copy link
Contributor

anayeaye commented Feb 29, 2024

What

The workflows API duplicates validations that are executed by the ingest API. We need to make sure these have the same effect and/or are removed as not needed in workflows.

Note this is just a reminder to confirm that we didn't migrate any of our legacy validation bugs to the new workflows API. Success could be as simple as confirming that the validators in the workflows API are functionally the same as the recently corrected validators in veda-backend/ingest-api

Moreover, the current workflows API schema base model does not include the renders or providers fields and will fail when run with those properties. Either these fields should be included in the workflows model or leave downstream schema validation to ingestion API.

AC

  • validations are deduplicated if needed, preferring to validate in ingest-api
@botanical botanical self-assigned this Mar 22, 2024
@botanical
Copy link
Member

I found one example of duplicated logic so far between the two repos (and will update this comment as I find more). The /dataset/publish endpoint seems to check if a collection exists twice.

Union[schemas.ZarrDataset, schemas.COGDataset] = Body(
..., discriminator="data_type"
),

and Dataset class has a validation to check if the collection exists because "we allow collection id to "break the rules" if an already-existing collection matches"

After the discover workflow kicks off an ingest workflow by calling the ingestion endpoint defined the veda-backend, veda-backend's enqueue_ingestion ’s parameter item uses schemas.AccessibleItem. This class has validators.collection_exists

Therefore, in veda-data-airflow and veda-backend, the check to validate that a collection exists is called twice.

TLDR

  • Duplication of collection exists check exists in /dataset/publish because it has a parameter dataset which has a type has inherits from the Dataset class which has a collection validator, and then the workflow kicks off an ingest workflow where similar logic is defined by the AccessibleItem in veda-backend

@botanical
Copy link
Member

Validators

veda-data-airflow veda-backend veda-stac-ingestor
check_dates() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schema_helpers.py#L42-L50 check_dates() - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schema_helpers.py#L42-L50 check_dates() -https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schema_helpers.py#L48-L56
check_extent() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schema_helpers.py#L27-L39 check_extent() - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schema_helpers.py#L28-L39 check_extent() - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schema_helpers.py#L33-L45
object_is_accessible() - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L156-L164 no matching function object_is_accessible() - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L266-L274
check_time_density() on Dataset class - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L205-L208 no matching function - check_time_density() on https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L72-L76 and check_time_density() on https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L308-L311
check_sample_files - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L223-L262 no matching function check_sample_files - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L326-L365
is_accessible on href - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L20-L34 is_accessible on href - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schemas.py#L22-L36 is_accessible on href - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L30-L44
exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L37-L43 exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-backend/blob/develop/ingest_api/runtime/src/schemas.py#L39-L45 exists() collection for AccessibleItem - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L47-L53
exists() for WorkflowInputBase - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L252 no matching function exists() for WorkflowInputBase - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L252
check_id() for Dataset - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L181-L203 NOTE: Allows unconventional naming if collection already exists no matching function check_id() for Dataset - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L234-L306 NOTE: Does not allow non-lowercase names
only_one_discover_item on ZarrDataset - https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/workflows_api/runtime/src/schemas.py#L265-L281 no matching function only_one_discover_item on ZarrDataset - https://github.com/NASA-IMPACT/veda-stac-ingestor/blob/main/api/src/schemas.py#L368-L384

cc @anayeaye @smohiudd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants