Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update output types of publishing assets, add cloud sensors #92

Merged
merged 37 commits into from
May 17, 2024

Conversation

yongrenjie
Copy link
Contributor

@yongrenjie yongrenjie commented May 16, 2024

WIP

Still to do:

  • metrics asset
  • metrics IO manager 🫠
    • okay this was surprisingly easy
  • the cloud sensors run every X seconds even for assets which haven't been re-materialised since the last run. Need to figure out how to stop that.
  • add validation inside the IO managers' handle_output section to make sure that the inputs have the right types, satisfy certain invariants, etc.

sgreenbury and others added 28 commits May 9, 2024 17:23
In particular:
 - TopLevelMetadata assets will return the actual classes instead of
   dataframes as this avoids unnecessary conversions between
   classes/dataframes (a common operation is extracting a field, e.g. an
   ID, in a downstream asset and while that can be done with
   df["field_name"].iloc[0] it's clenaer to just use class.field_name)
 - Geometry assets will return a list of (GeometryMetadata, gdf,
   names_df) because they cannot be separately partitioned

The new IO managers have been disabled in this commit because I have
introduced dependencies between some of the outputs. For example, the
SourceDataRelease requires the ID of the GeometryMetadata which is
constructed at runtime by the geometry asset.

Since our IO managers only allow for output and not input, it is
impossible to produce any assets that depend on those using the new IO
managers.

The way around this is to use the default IO managers for these assets
themselves, and then create NEW assets which depend on these assets and
exist solely to handle the output to local / Azure. Indeed, these 'new'
assets can just be cloud sensors (which leads us VERY nicely into that
bit of work which we were always going to do!)

NOTE: This commit will break the Azure IO managers. They will be fixed
soon.
…ent)

I'm very much unsure what happens if two IO managers attempt to use the
lease at the same time. To get the geometry sensor to work, I had to
lengthen the sensor interval to 60 seconds and also disable the metadata
sensor.
@yongrenjie yongrenjie changed the base branch from main to 75-io-managers-azure May 16, 2024 10:23
@yongrenjie yongrenjie requested a review from sgreenbury May 16, 2024 15:52
@yongrenjie yongrenjie changed the base branch from 75-io-managers-azure to main May 17, 2024 14:40
@sgreenbury sgreenbury mentioned this pull request May 17, 2024
4 tasks
@sgreenbury
Copy link
Collaborator

sgreenbury commented May 17, 2024

Noting on areas to be covered in other issues:

  • Currently test_cloud_outputs.py is commented out/skipped since the implementation has changed. This will be revised in new issue TBD
  • Update SAS token to use EnvVar (see dagster docs)
  • Refactoring the Belgium DAG with config expressed not in in an asset for needed datasets and the transformations for derived metrics
  • See above re. @yongrenjie comment on validation for IO managers input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done:
Development

Successfully merging this pull request may close these issues.

2 participants