-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Autodetect AWS region during deltalake scan #3104
Conversation
CodSpeed Performance ReportMerging #3104 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3104 +/- ##
==========================================
- Coverage 78.63% 78.30% -0.34%
==========================================
Files 616 617 +1
Lines 73149 73437 +288
==========================================
- Hits 57524 57505 -19
- Misses 15625 15932 +307
|
daft/delta_lake/delta_lake_scan.py
Outdated
try: | ||
client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3) | ||
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boto3 superclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, at least log if we're going to move ahead silently
daft/delta_lake/delta_lake_scan.py
Outdated
client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3) | ||
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc) | ||
except BotoCoreError: | ||
# If bucket location request fails, ignore error because S3 config may be correctly populated later from environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log rather than failing silently. Check out logger.exception for the canonical way of doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really want to log anything here because it could be confusing to users. This is actually why I put this before the code that gets the config from the environment, because that part logs when it fails and I have seen users report this in the past when the actual root issue was elsewhere.
What if I log it as warning with a friendly message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I logged an informative warning instead
|
||
|
||
def boto3_client_from_s3_config(service: str, s3_config: S3Config) -> "boto3.client": | ||
import boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to amend our requirements to make boto3 a requirement when using deltalake then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically you can use it with only Azure which is why I decided not to list it as an explicit requirement. Thoughts on that?
client = boto3_client_from_s3_config("s3", deltalake_sdk_io_config.s3) | ||
response = client.get_bucket_location(Bucket=urlparse(table_uri).netloc) | ||
except BotoCoreError as e: | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out logger.exception for a more canonical way of doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh shoot I totally read it as logger.error when you mentioned it before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm actually it looks pretty similar. Is that something we'd want to do? I feel that I don't want it to be logged with error level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception is special, and gives a bunch of traceback which is useful.
Benefits of using logger.exception():
Automatic traceback logging: You don't need to manually format and log the traceback information.
Consistent logging format: The logging format is standardized, making it easier to read and analyze logs.
Error level: The message is logged at the ERROR level, which helps you identify and filter critical issues.
We do this already when we read from S3, but delta-rs does not, so their metadata reads fail. This is especially an issue for unity catalog tables, where the region is not specified anywhere.
Tested locally, it works but I'm unsure how I would test this in unit tests since this is sort of unity+aws behavior specific
Fix for #2903