Improving the dataset preprocessing #5

plonerma · 2024-11-13T16:44:34Z

We discussed improving how the dataset is preprocessed. I added an test-case to ensure the original dataset is not being changed (but instead a changed copy is returned) and started improving some minor points.

There are some open points that should be addressed in the future:

The DataCleaner should be EITHER stateless (i.e. is configured in the init and preprocessing dataset does not change any attributes) OR statefull (i.e. it should only be invoke once with a dataset).
We need to change the order of the processing steps, as at the moment string-labels do not work.

lukasgarbas

Did some refactoring. Datacleaner doesn't change its instance attributes or the original dataset. prepare_dataset(dataset) now returns texts, labels, and the task category.

Changed the order of data preprocessing steps. String labels are now converted to integers when creating the label map.

plonerma added 3 commits November 13, 2024 16:42

Fixed two tiny things

6d22002

Added additional test-case for dataset preprocessing

136de72

Made some steps in the datacleaner more explicit

aad7d7f

plonerma requested a review from lukasgarbas November 13, 2024 16:44

lukasgarbas added 4 commits November 22, 2024 15:22

Merge branch 'main' into improving_data_preprocessing

67b5f14

Refactor datacleaner

28938a1

Remove dataset wrapper

9e5651f

Reorder preprocessing steps

ca6c286

lukasgarbas reviewed Nov 30, 2024

View reviewed changes

lukasgarbas marked this pull request as ready for review November 30, 2024 03:30

lukasgarbas merged commit 1b416b3 into main Nov 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the dataset preprocessing #5

Improving the dataset preprocessing #5

plonerma commented Nov 13, 2024 •

edited by lukasgarbas

Loading

lukasgarbas left a comment

Improving the dataset preprocessing #5

Improving the dataset preprocessing #5

Conversation

plonerma commented Nov 13, 2024 • edited by lukasgarbas Loading

lukasgarbas left a comment

Choose a reason for hiding this comment

plonerma commented Nov 13, 2024 •

edited by lukasgarbas

Loading