[WIP] use dataset.map in pipeline #179

ArneBinder · 2022-05-07T17:29:10Z

If documents of type Dataset is passed to the pipeline, use documents.map to add the predictions. In this case, a Dataset is returned instead of Sequence[Document].

Note: Builds on top of #178 which should be reviewed first.

tests_no_local_datasets is passing.

EDIT: For now, an error is thrown when datasets attempts to cache pipeline results, see ee2d35f.

ChristophAlt · 2022-05-09T06:09:37Z

I'm sorry, but as discussed yesterday I'm not very happy with that. As far as I'm concerned, this adds very little in terms of functionality but a lot of unknown factors regarding HF's dataset caching internals.

src/pytorch_ie/pipeline.py

ArneBinder · 2022-05-09T10:28:06Z

I'm sorry, but as discussed yesterday I'm not very happy with that. As far as I'm concerned, this adds very little in terms of functionality but a lot of unknown factors regarding HF's dataset caching internals.

I will respond soon.

EDIT: Finally, I got some time to respond.

The points I had in mind are:

Caching and its problems: For now, I used datasets.disable_caching() to disable the caching functionality. But I think that should be let up to the user via a parameter allow_caching with False as default. However, as stated here, the fingerprinting is based on the pickled function and if that is not pickleable, the function is re-executed each time (fyi I just added a test that fails when the fingerprint is the same for processing the same input twice with the same pipeline, but not sure, what we can imply of that). In general, I think there is a lot of improvement to the caching logic of datasets, it is not comparable to the early days. If we do not trust that, we should also think about if we really want to allow caching for any documents.map or the pie dataset creation at all. I don't see, why errors within these processes should be easier to detect than here.
Iterative datasets: With this PR, they should be able to run out of the box, also in a chained pipeline setup.
Consistency & user expectations: Why do we allow to pass a dataset to the pipeline in the first place? I think if we allow this, the user expects to also get a dataset in return.
Postprocessing: The user can take advantage from hf datasets and is able to use any dataset.map etc. for their postprocessing logic.

As usual, that is my impression on this issue which may be a result of lack of knowledge :)

ArneBinder requested a review from ChristophAlt May 7, 2022 17:29

ArneBinder force-pushed the map_dataset_in_pipeline branch from 2f72673 to a983f7c Compare May 7, 2022 18:21

ChristophAlt reviewed May 9, 2022

View reviewed changes

src/pytorch_ie/pipeline.py Outdated Show resolved Hide resolved

ArneBinder force-pushed the map_dataset_in_pipeline branch from ee2d35f to e8d0d1d Compare May 9, 2022 09:44

ArneBinder changed the title ~~use dataset.map in pipeline~~ [WIP] use dataset.map in pipeline Jul 18, 2022

ArneBinder force-pushed the map_dataset_in_pipeline branch from 1a65c49 to ec3a3ea Compare February 14, 2023 10:38

ArneBinder added 8 commits September 17, 2023 21:43

use dataset.map in pipeline

a47d2dc

add test_pipeline_with_dataset_never_cached

8668486

raise exception when datasets would like to cache pipeline result

734fc3c

fix disable caching for pipeline

6290566

isort

3d7c1c0

do not allow parameters for documents.map to simplify pipeline

47ca43b

add missing result_document_type

85d1a38

make pre-commit happy

6d63f08

ArneBinder force-pushed the map_dataset_in_pipeline branch from 4259649 to 6d63f08 Compare September 17, 2023 19:43

make pre-commit happy

2ddd79a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] use dataset.map in pipeline #179

[WIP] use dataset.map in pipeline #179

ArneBinder commented May 7, 2022 •

edited

Loading

ChristophAlt commented May 9, 2022

ArneBinder commented May 9, 2022 •

edited

Loading

[WIP] use dataset.map in pipeline #179

Are you sure you want to change the base?

[WIP] use dataset.map in pipeline #179

Conversation

ArneBinder commented May 7, 2022 • edited Loading

ChristophAlt commented May 9, 2022

ArneBinder commented May 9, 2022 • edited Loading

ArneBinder commented May 7, 2022 •

edited

Loading

ArneBinder commented May 9, 2022 •

edited

Loading