-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add git to filesystem b 301 #350
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests (including dynamic test repo) are amazing. I have two bigger asks
- support full configs in FileDict (change in the core)
- name the tests with
ids
) | ||
|
||
# NOTE: `git init -b` option requires git 2.28 or later. | ||
subprocess.call(f"git init -b {safe_prefix}master", shell=True, cwd=repo_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that somehow more convenient than using gitpython
that we have in deps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. For gitpython way, do you mean:
- just using
git.<command>(...)
instead of the abovesubprocess.call()
. Like this https://gitpython.readthedocs.io/en/stable/tutorial.html#using-git-directly ? - Or deeper in the gitpython API? Oh, I think it's read-only.
For me subprocess.call()
is easier to read because it looks like git on the command line. I don't want the reader to have to learn a new API to understand the test data. Also it's how I saw test repo done in other fsspec implementations.
The code lines did get kind of long though
I am happy to try the gitpython way. Let me know.
sources/filesystem/__init__.py
Outdated
|
||
files_chunk: List[FileItem] = [] | ||
for file_model in glob_files(fs_client, bucket_url, file_glob): | ||
file_dict = FileItemDict(file_model, credentials) | ||
file_dict = FileItemDict(file_model, fs_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is way better if we pass credentials
here. it often happens that file items are processed in different threads and I'd avoid sharing a single instance of filesystem. if you need to pass additional kwargs I think we have support in credentials for those. so you can just set kwargs
to credentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok now I see: we should not pass credentials
to FileItemDict
, we should pass
FilesystemConfiguration(protocol, credentials, kwargs=kwargs, client_kwargs=client_kwargs)
so that would require changing your branch to the core and in the FileDict
implementation change:
@property
def fsspec(self) -> AbstractFileSystem:
to use: fsspec_from_config
looks like big improvement: we can handle parametrized filesystems now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great idea. I'll try it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it often happens that file items are processed in different threads and I'd avoid sharing a single instance of filesystem.
@rudolfix Note that fsspec instances are cacheable by default. (It is separate to the dir listings cache). Is that a problem , or will each thread get it's own cache?
It can be disabled by passing skip_instance_cache=True
on instantiation. I can disable the cache in this PR, but it if needs testing I suggest a separate github issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have got this working. A question though, about the imports in verified-sources sources/filesystem/__init__.py
from dlt.sources.filesystem import FileItem, FileItemDict, fsspec_filesystem, glob_files
from dlt.sources.credentials import FileSystemCredentials
from dlt.common.storages.fsspec_filesystem import fsspec_from_config
from dlt.common.storages.configuration import FilesystemConfiguration
I added the last two lines. But should I be importing them via dlt.sources.<some_module>
instead?
What is the reason for dlt.sources.<some_module>
? Is it some kind of wrapper to expose code to dlt Sources?
Also I no longer need to import fsspec_filesystem
(line 2) . Do I remove it in dlt.sources.filesystem too?
- also introduces support for passing keyword argument to fsspec. gitpythonfs is the first use case where the kwargs are needed. - requires future version of dlt that supports gitpython and dynamic registration of fssepc implementations. - requires manual creation of a git repo for testing gitpythonfs - if kwargs are used to construct filesystem instances then FileItemDict must be instantiated with an existing instance of AbstractFileSystem. Instantiating FileItemDict with FileSystemCredentials will omit the fs kwargs and cause unexpected behaviour.
a31cbbf
to
0fff16e
Compare
0fff16e
to
fbbec91
Compare
- objects yielded by @dlt.resource can run in different threads - note this does not address possible bug with filesystem instances being cached by fsspec itself.
Tell us what you do here
verified source
)Relevant issue
issue #301
More PR info
@rudolfix please have a quick look and let me know next steps.
CICD issues
I might need assistance with CICD issues on local/github:
not sure what to do with lock file.
Probably due to me working across both dlt repos, which I haven't done before.
Related work: (separate github issues?):
Example pipeline. Provide an example of loading doco files (eg
docs/**.md
) to duck db? The existing filesystem examples are about loading row-based data from csv etc, which is not so relevant to git. The example could read from verified-sources repo itself, or refactor the new git repo test fixture to provide a temporary git repo on each example run.Update
sources/filesystem/[README.md](http://readme.md
. Ideas: