Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Autodetect AWS region during deltalake scan #3104

Merged
merged 7 commits into from
Oct 23, 2024

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Oct 22, 2024

We do this already when we read from S3, but delta-rs does not, so their metadata reads fail. This is especially an issue for unity catalog tables, where the region is not specified anywhere.

Tested locally, it works but I'm unsure how I would test this in unit tests since this is sort of unity+aws behavior specific

Fix for #2903

@github-actions github-actions bot added the bug Something isn't working label Oct 22, 2024
@kevinzwang kevinzwang requested a review from jaychia October 22, 2024 23:04
Copy link

codspeed-hq bot commented Oct 22, 2024

CodSpeed Performance Report

Merging #3104 will not alter performance

Comparing kevin/unity-aws-region (ff3444f) with main (459ba82)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Oct 22, 2024

Codecov Report

Attention: Patch coverage is 55.55556% with 8 lines in your changes missing coverage. Please review.

Project coverage is 78.30%. Comparing base (7d600c2) to head (ff3444f).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
daft/delta_lake/delta_lake_scan.py 22.22% 7 Missing ⚠️
daft/io/aws_config.py 85.71% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3104      +/-   ##
==========================================
- Coverage   78.63%   78.30%   -0.34%     
==========================================
  Files         616      617       +1     
  Lines       73149    73437     +288     
==========================================
- Hits        57524    57505      -19     
- Misses      15625    15932     +307     
Files with missing lines Coverage Δ
daft/io/catalog.py 86.04% <100.00%> (-0.32%) ⬇️
daft/io/aws_config.py 85.71% <85.71%> (ø)
daft/delta_lake/delta_lake_scan.py 77.41% <22.22%> (-4.32%) ⬇️

... and 17 files with indirect coverage changes

try:
client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3)
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boto3 superclass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, at least log if we're going to move ahead silently

client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3)
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc)
except BotoCoreError:
# If bucket location request fails, ignore error because S3 config may be correctly populated later from environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log rather than failing silently. Check out logger.exception for the canonical way of doing this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really want to log anything here because it could be confusing to users. This is actually why I put this before the code that gets the config from the environment, because that part logs when it fails and I have seen users report this in the past when the actual root issue was elsewhere.

What if I log it as warning with a friendly message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I logged an informative warning instead



def boto3_client_from_s3_config(service: str, s3_config: S3Config) -> "boto3.client":
import boto3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to amend our requirements to make boto3 a requirement when using deltalake then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you can use it with only Azure which is why I decided not to list it as an explicit requirement. Thoughts on that?

@kevinzwang kevinzwang enabled auto-merge (squash) October 23, 2024 01:13
@kevinzwang kevinzwang merged commit 0727dc1 into main Oct 23, 2024
39 of 40 checks passed
@kevinzwang kevinzwang deleted the kevin/unity-aws-region branch October 23, 2024 01:26
client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3)
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc)
except BotoCoreError as e:
logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out logger.exception for a more canonical way of doing this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot I totally read it as logger.error when you mentioned it before

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm actually it looks pretty similar. Is that something we'd want to do? I feel that I don't want it to be logged with error level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception is special, and gives a bunch of traceback which is useful.

Benefits of using logger.exception():
Automatic traceback logging: You don't need to manually format and log the traceback information.
Consistent logging format: The logging format is standardized, making it easier to read and analyze logs.
Error level: The message is logged at the ERROR level, which helps you identify and filter critical issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants