Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bmenasha/cai fix issue 900 #905

Merged

Conversation

bmenasha
Copy link
Contributor

Don't overide the timestamp field's type.

  • Closes asset inventory dataflow jobs failing with "The field specified for time partitioning can only be of type TIMESTAMP, DATE or DATETIME. The type found is: STRING." and "A hot key was detected in step" #900 by not overriding timestamp's type from timestamp to string as it's
    used as a table partition.

  • Performance improvements:

    • Don't perform a deep copy of the destination field within _merge_fields,
      it's not necessary as it's always a new object and Cloud Profiler says it
      was resource intensive.

    • Don't deep copy the 'data' field within ProduceResourceJson by removing it
      prior to the deep copy and adding it back to the element after.

  • Fix a bug where enforce_schema wouldn't work because it yieled the original
    elements, it needs to yeild a new element key pair.

  • Fix a bug when sharding asset types were we would delete the table while
    loading because not all shards were processed before deleting and loading
    occured. To fix this we now group by table name after WriteToGCS rather then
    shard key. Then assign back the sharded prior to loading.

@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines. label Oct 10, 2022
- Fix GoogleCloudPlatform#900 by not overriding timestamp's type from `timestamp` to string as it's
  used as a table partition.

- Performance improvements:

  - Don't perform a deep copy of the destination field within _merge_fields,
    it's not necessary as it's always a new object and Cloud Profiler says it
    was resource intensive.

  - Don't deep copy the 'data' field within ProduceResourceJson by removing it
    prior to the deep copy and adding it back to the element after.

- Fix a bug where enforce_schema wouldn't work because it yieled the original
  elements, it needs to yeild a new element key pair.

- Fix a bug when sharding asset types were we would delete the table while
  loading because not all shards were processed before deleting and loading
  occured. To fix this we now group by table name after WriteToGCS rather then
  shard key. Then assign back the sharded prior to loading.
@bmenasha
Copy link
Contributor Author

can someone review this? thanks

@bmenasha
Copy link
Contributor Author

can someone review this?
thanks

@bmenasha
Copy link
Contributor Author

There is another fix I wish to commit for issue #910 but they depend on this being merged.
Can someone please review this?
thanks
Ben

Copy link

@ojmendezz ojmendezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agold-rh agold-rh merged commit 670a3e5 into GoogleCloudPlatform:main Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
3 participants