Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add git to filesystem b 301 #350

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

deanja
Copy link
Contributor

@deanja deanja commented Feb 9, 2024

Tell us what you do here

  • implementing verified source (please link a relevant issue labeled as verified source)
  • fixing a bug (please link a relevant bug report)
  • [x ] improving, documenting, or customizing an existing source (please link an issue or describe below)
  • anything else (please link an issue or describe below)

Relevant issue

issue #301

More PR info

@rudolfix please have a quick look and let me know next steps.

CICD issues

I might need assistance with CICD issues on local/github:

  • mypy is failing on AssertionError: Cannot find component 'trace' for 'opentelemetry.trace'
    not sure what to do with lock file.
  • pytest gives non-fatal error: airflow.logging_config:logging_config.py:71 Unable to load the config, contains a configuration error.
  • not sure how to handle poetry.lock. Poetry says to update but I am wary of committing changes.

Probably due to me working across both dlt repos, which I haven't done before.

Related work: (separate github issues?):

  • Example pipeline. Provide an example of loading doco files (eg docs/**.md) to duck db? The existing filesystem examples are about loading row-based data from csv etc, which is not so relevant to git. The example could read from verified-sources repo itself, or refactor the new git repo test fixture to provide a temporary git repo on each example run.

  • Update sources/filesystem/[README.md](http://readme.md. Ideas:

    • add documentation about passing kwargs to fsspec implementations.
    • List of "supported" fsspec implementations. +gitpythonfs
    • General mention of compatibility with any fsspec implementation. And steps to request it be added.

@rudolfix rudolfix self-requested a review February 12, 2024 13:05
@rudolfix rudolfix added the ci from fork Allows to run tests from PR coming from fork label Feb 12, 2024
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests (including dynamic test repo) are amazing. I have two bigger asks

  • support full configs in FileDict (change in the core)
  • name the tests with ids

tests/filesystem/conftest.py Show resolved Hide resolved
)

# NOTE: `git init -b` option requires git 2.28 or later.
subprocess.call(f"git init -b {safe_prefix}master", shell=True, cwd=repo_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that somehow more convenient than using gitpython that we have in deps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For gitpython way, do you mean:

  1. just using git.<command>(...) instead of the above subprocess.call(). Like this https://gitpython.readthedocs.io/en/stable/tutorial.html#using-git-directly ?
  2. Or deeper in the gitpython API? Oh, I think it's read-only.

For me subprocess.call() is easier to read because it looks like git on the command line. I don't want the reader to have to learn a new API to understand the test data. Also it's how I saw test repo done in other fsspec implementations.

The code lines did get kind of long though ☹️ . And gitpython would let other settings be made if needed.

I am happy to try the gitpython way. Let me know.


files_chunk: List[FileItem] = []
for file_model in glob_files(fs_client, bucket_url, file_glob):
file_dict = FileItemDict(file_model, credentials)
file_dict = FileItemDict(file_model, fs_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is way better if we pass credentials here. it often happens that file items are processed in different threads and I'd avoid sharing a single instance of filesystem. if you need to pass additional kwargs I think we have support in credentials for those. so you can just set kwargs to credentials

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok now I see: we should not pass credentials to FileItemDict, we should pass
FilesystemConfiguration(protocol, credentials, kwargs=kwargs, client_kwargs=client_kwargs)

so that would require changing your branch to the core and in the FileDict implementation change:

@property
    def fsspec(self) -> AbstractFileSystem:

to use: fsspec_from_config

looks like big improvement: we can handle parametrized filesystems now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great idea. I'll try it

Copy link
Contributor Author

@deanja deanja Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it often happens that file items are processed in different threads and I'd avoid sharing a single instance of filesystem.

@rudolfix Note that fsspec instances are cacheable by default. (It is separate to the dir listings cache). Is that a problem , or will each thread get it's own cache?

It can be disabled by passing skip_instance_cache=True on instantiation. I can disable the cache in this PR, but it if needs testing I suggest a separate github issue.

Copy link
Contributor Author

@deanja deanja Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have got this working. A question though, about the imports in verified-sources sources/filesystem/__init__.py

from dlt.sources.filesystem import FileItem, FileItemDict, fsspec_filesystem, glob_files
from dlt.sources.credentials import FileSystemCredentials
from dlt.common.storages.fsspec_filesystem import fsspec_from_config
from dlt.common.storages.configuration import FilesystemConfiguration

I added the last two lines. But should I be importing them via dlt.sources.<some_module> instead?

What is the reason for dlt.sources.<some_module> ? Is it some kind of wrapper to expose code to dlt Sources?

Also I no longer need to import fsspec_filesystem (line 2) . Do I remove it in dlt.sources.filesystem too?

tests/filesystem/settings.py Show resolved Hide resolved
tests/filesystem/test_filesystem.py Outdated Show resolved Hide resolved
tests/filesystem/test_filesystem.py Show resolved Hide resolved
deanja and others added 3 commits February 16, 2024 09:00
- also introduces support for passing keyword argument to fsspec.  gitpythonfs is the first use case where the kwargs are needed.
- requires future version of dlt that supports gitpython and dynamic registration of fssepc implementations.
- requires manual creation of a git repo for testing gitpythonfs
- if kwargs are used to construct filesystem instances then FileItemDict must be instantiated with an existing instance of AbstractFileSystem.  Instantiating FileItemDict with FileSystemCredentials will omit the fs kwargs and cause unexpected behaviour.
- objects yielded by @dlt.resource can run in different threads
- note this does not address possible bug with filesystem instances being cached by fsspec itself.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci from fork Allows to run tests from PR coming from fork
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants