Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow additional arguments to be passed to fsspec via FilesystemConfiguration #814

Closed
rudolfix opened this issue Dec 10, 2023 · 11 comments · Fixed by #869
Closed

allow additional arguments to be passed to fsspec via FilesystemConfiguration #814

rudolfix opened this issue Dec 10, 2023 · 11 comments · Fixed by #869
Assignees

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 10, 2023

Background
Our users should be able to pass additional arguments to fsspec filesystems when they are created. A good use case is to enable custom SSL checks on s3:
fsspec/s3fs#342
requires additional argument use_ssl and additional element in client_kwargs (`verify')

Tasks

    • Extent FilesystemConfiguration so it accepts additional arguments and client arguments ie. via config fields accepting dictionaries
    • apply arguments in fsspec_from_config properly
    • Improve the destination factory for filesystem to accept those additional arguments nicely, write docstrings etc.
    • write tests that make sure that arguments are passed. ie. by providing arguments that break things in predictable way. ie. providing wrong certificate in example above
    • PR should go to devel branch
@rudolfix rudolfix moved this from Todo to Planned in dlt core library Dec 13, 2023
@Pipboyguy Pipboyguy pinned this issue Dec 13, 2023
@Pipboyguy Pipboyguy unpinned this issue Dec 13, 2023
@Pipboyguy Pipboyguy moved this from Planned to In Progress in dlt core library Dec 15, 2023
Pipboyguy added a commit to MooncoonDev/dlt that referenced this issue Dec 16, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 17, 2023
Pipboyguy added a commit that referenced this issue Dec 18, 2023
@Pipboyguy
Copy link
Collaborator

@rudolfix Do we have a specific cert we want to use in our test suite? Does CI have some certificate installed I can use?

@rudolfix
Copy link
Collaborator Author

@Pipboyguy yeah I can create another account on AWS that requires certificate to connect. but maybe we could try this: generate a mock certificate, use it to connect and make sure that we get relevant SSL error which means that certificate is present

WDYT?

@Pipboyguy
Copy link
Collaborator

Pipboyguy commented Dec 18, 2023

With a self signed cert I guess we'd need a MinIO deployment? If we go with S3 we'd need CloudFront to handle the TLS connection.

@Pipboyguy yeah I can create another account on AWS that requires certificate to connect. but maybe we could try this: generate a mock certificate, use it to connect and make sure that we get relevant SSL error which means that certificate is present

WDYT?

@Pipboyguy
Copy link
Collaborator

Pipboyguy commented Dec 18, 2023

I'll start with a local MinIO container until further guidance. It's testing it in CI that I'm scratching my head over

@deanja
Copy link
Contributor

deanja commented Dec 19, 2023

Following this issue as I think it will help with implementing git/github fsspec dlt-hub/verified-sources#301 .

The initial case there is to pass org and repo as keyword arguments to filesystem() . Because:

  • it keeps the bucket_url and globbing very clean, ie github:// and path/to/**.md.
  • org and repo can be included in the fsspec url as github://org:repo@/path/to/file but currently dlt doesn't parse those correctly.

Pipboyguy added a commit that referenced this issue Dec 19, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 19, 2023
Pipboyguy added a commit that referenced this issue Dec 19, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 19, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 19, 2023
@Pipboyguy
Copy link
Collaborator

This default was passed to the filesystem factory each time:

fs_kwargs["use_listings_cache"] = False # is this default necessary?

I kept it for now, but I wonder whether this shouldn't be an additional argument passed by the user? Should it remain a default configuration?

@deanja
Copy link
Contributor

deanja commented Dec 20, 2023

I looked at the repos at https://github.com/fsspec . It looks like many fsspec implementations - including s3fs, gdrivefs, ftp - do cache the listings (in a DirCache) and they accept fsspec's default of use_listings_cache = True.

I'm pretty new to dlt so I don't know why it is hard-coded to False.

My vote would be to leave the dlt default as False, but give user option to set to True,

This default was passed to the filesystem factory each time:

fs_kwargs["use_listings_cache"] = False # is this default necessary?

I kept it for now, but I wonder whether this shouldn't be an additional argument passed by the user? Should it remain a default configuration?

@deanja
Copy link
Contributor

deanja commented Dec 20, 2023

@Pipboyguy Is it in your scope to support the additional arguments in the fsspec_filesystem() function too? (in dlt/common/storages/fsspec_filesystem.py) . That function is how the fsspec factory gets called from the dlt filesystem verified source

@rudolfix
Copy link
Collaborator Author

@deanja the reason to get rid of the cache is that most of dlt operations are multiprocess or multithread so even with proper synchronization we were getting "wrong" listings due to cache. so now it is disabled by default.

I also say explicitly that I do not want the cache when doing ls but I'm not sure if just ls is impacted or it uses cache for exists or isfile etc. fsspec is amazing library but the docs are lacking and code is a little bit messy so I prefer to be super strict.

@deanja
Copy link
Contributor

deanja commented Dec 21, 2023

I agree. Caches are mostly frightening! fsspec cache is not accessed in most ETL use cases anyway as you:

  • `ls'
  • iterate once to get file details, and maybe extract content ( fsspec does not use cache for open)
  • done

@Pipboyguy
Copy link
Collaborator

Pipboyguy commented Dec 22, 2023

@Pipboyguy Is it in your scope to support the additional arguments in the fsspec_filesystem() function too? (in dlt/common/storages/fsspec_filesystem.py) . That function is how the fsspec factory gets called from the dlt filesystem verified source

@deanja One could argue its in this scope. This would be trivial:

def fsspec_filesystem(
    protocol: str,
    credentials: FileSystemCredentials = None,
    kwargs: Optional[DictStrAny] = None,
    client_kwargs: Optional[DictStrAny] = None,
) -> Tuple[AbstractFileSystem, str]:
    return fsspec_from_config(
        FilesystemConfiguration(
            protocol, credentials, kwargs=kwargs, client_kwargs=client_kwargs
        )
    )

Pipboyguy added a commit that referenced this issue Dec 22, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 22, 2023
)

Refactored common storage configuration and revised imports in line with PEP rules. Also updated the string formatting in filesystem related tests and moved `test_s3_wrong_certificate` method to `test_filesystem_client.py`. There’s an addition of `kwargs` and `client_kwargs` parameters to `fsspec_filesystem` function allowing for a more configurable function.

Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 23, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 23, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 24, 2023
Pipboyguy added a commit that referenced this issue Dec 24, 2023
Pipboyguy added a commit that referenced this issue Dec 24, 2023
Updated the test case for S3 SSL certificate verification to use a self-signed certificate generated during the test. This change ensures the test is more reliable, as it doesn't depend on external, possibly changing, data. Improved the test setup by creating and using a temporary certificate file, which gets cleaned up after use.

Signed-off-by: Marcel Coetzee <[email protected]>
@Pipboyguy Pipboyguy linked a pull request Dec 24, 2023 that will close this issue
Pipboyguy added a commit that referenced this issue Dec 24, 2023
Pipboyguy added a commit that referenced this issue Dec 26, 2023
Pipboyguy added a commit that referenced this issue Dec 26, 2023
Pipboyguy added a commit that referenced this issue Dec 26, 2023
Pipboyguy added a commit that referenced this issue Dec 26, 2023
Pipboyguy added a commit that referenced this issue Dec 26, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Dec 28, 2023
Pipboyguy added a commit that referenced this issue Dec 29, 2023
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Jan 1, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Jan 1, 2024
Updated the filesystem.md documentation to improve clarity,
fix grammar mistakes, and ensure better readability.
The changes include breaking up longer sentences,
correcting spellings and punctuations. (#814)

Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Jan 1, 2024
@Pipboyguy Pipboyguy linked a pull request Jan 1, 2024 that will close this issue
Pipboyguy added a commit that referenced this issue Jan 2, 2024
Pipboyguy added a commit that referenced this issue Jan 2, 2024
Pipboyguy added a commit that referenced this issue Jan 2, 2024
Pipboyguy added a commit that referenced this issue Jan 3, 2024
This commit updates the handling of filesystem and credential configuration for different cloud storage protocols.

Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Jan 3, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
Pipboyguy added a commit that referenced this issue Jan 4, 2024
@Pipboyguy Pipboyguy moved this from In Progress to Done in dlt core library Jan 6, 2024
@sh-rp sh-rp closed this as completed in #869 Jan 8, 2024
Pipboyguy added a commit that referenced this issue Jan 9, 2024
Pipboyguy added a commit that referenced this issue Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants