Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] use dataset.map in pipeline #179

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented May 7, 2022

If documents of type Dataset is passed to the pipeline, use documents.map to add the predictions. In this case, a Dataset is returned instead of Sequence[Document].

Note: Builds on top of #178 which should be reviewed first.

tests_no_local_datasets is passing.

EDIT: For now, an error is thrown when datasets attempts to cache pipeline results, see ee2d35f.

@ArneBinder ArneBinder requested a review from ChristophAlt May 7, 2022 17:29
@ArneBinder ArneBinder force-pushed the map_dataset_in_pipeline branch from 2f72673 to a983f7c Compare May 7, 2022 18:21
@ChristophAlt
Copy link
Collaborator

I'm sorry, but as discussed yesterday I'm not very happy with that. As far as I'm concerned, this adds very little in terms of functionality but a lot of unknown factors regarding HF's dataset caching internals.

@ArneBinder ArneBinder force-pushed the map_dataset_in_pipeline branch from ee2d35f to e8d0d1d Compare May 9, 2022 09:44
@ArneBinder
Copy link
Owner Author

ArneBinder commented May 9, 2022

I'm sorry, but as discussed yesterday I'm not very happy with that. As far as I'm concerned, this adds very little in terms of functionality but a lot of unknown factors regarding HF's dataset caching internals.

I will respond soon.

EDIT: Finally, I got some time to respond.

The points I had in mind are:

  1. Caching and its problems: For now, I used datasets.disable_caching() to disable the caching functionality. But I think that should be let up to the user via a parameter allow_caching with False as default. However, as stated here, the fingerprinting is based on the pickled function and if that is not pickleable, the function is re-executed each time (fyi I just added a test that fails when the fingerprint is the same for processing the same input twice with the same pipeline, but not sure, what we can imply of that). In general, I think there is a lot of improvement to the caching logic of datasets, it is not comparable to the early days. If we do not trust that, we should also think about if we really want to allow caching for any documents.map or the pie dataset creation at all. I don't see, why errors within these processes should be easier to detect than here.
  2. Iterative datasets: With this PR, they should be able to run out of the box, also in a chained pipeline setup.
  3. Consistency & user expectations: Why do we allow to pass a dataset to the pipeline in the first place? I think if we allow this, the user expects to also get a dataset in return.
  4. Postprocessing: The user can take advantage from hf datasets and is able to use any dataset.map etc. for their postprocessing logic.

As usual, that is my impression on this issue which may be a result of lack of knowledge :)

@ArneBinder ArneBinder changed the title use dataset.map in pipeline [WIP] use dataset.map in pipeline Jul 18, 2022
@ArneBinder ArneBinder force-pushed the map_dataset_in_pipeline branch from 1a65c49 to ec3a3ea Compare February 14, 2023 10:38
@ArneBinder ArneBinder force-pushed the map_dataset_in_pipeline branch from 4259649 to 6d63f08 Compare September 17, 2023 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants