add document converters #332

ArneBinder · 2023-09-08T01:40:38Z

This PR moves the main logic for document preparation, i.e. converting the documents of a loaded dataset to a format that is understood by a taskmodule, to the dataset builder scripts. As a result, that

logic is easily shareable via the Huggingface hub, and
allows for document auto-conversion.

With this PR, document converters can be added to dataset builders either vie the class variable DOCUMENT_CONVERTERS of the builder or dynamically via the parameter document_converters when calling load_dataset() . In addition, Dataset, IterabelDataset, and DatasetDict have these new methods:

register_document_converter(converter, document_type=None): register a new converter for a target document type. If the document type is not provided, it is tried to infer it from the converter return type
to_document_type(document_type): convert to the document type by using the registered converters.

If the converter is a function, it will be applied via map() to all documents. If it is a dict, it is interpreted as a field name mapping and we perform a simple cast with it (if you just need to rename fields this is much more efficient). If no converter is registered for the requested document type, but for a super class of it, we use that one and cast to the requested document type afterwards. If no valid converter is found at all, we simply cast to the requested document type.

In addition, TaskModules can now define a class parameter DOCUMENT_TYPE (default is None) to signal what document type they require. This value can and should only be accessed via the taskmodule.document_type property. Per default, this just returns the value of DOCUMENT_TYPE, but can be used to implement special variants, e.g. if we want to use sentence partitions, the taskmodule should return TextDocumentWith...AndSentences instead of just TextDocumentWith....

The combination of dataset.to_document_type() and taskmodule.document_type allows for auto-conversion in most of the cases. However, this is fully optional and therefore backwards compatible.

Usage in downstream projects (e.g. created from pytorch-ie-hydra-template):

in train.py, evaluate.py and predict.py: call dataset = dataset.to_document_type(taskmodule.document_type) if taskmodule.document_type is defined and dataset.document_type is not already a subclass of it
in addition, for dataset configs that require task specific preprocessing (e.g. RE) call DatasetDict.to_document_type(...) via _convert_documents.yaml before these preprocessing steps
when using generic dataset scripts (e.g. pie/brat), add converters via the document_converters parameter for load_dataset()

This also adds the following document types to be used as common reference points:

TextDocumentWithEntitiesAndRelations
TextDocumentWithLabeledEntitiesAndRelations
TextDocumentWithLabeledEntitiesRelationsAndLabeledPartitions
and more

Todo:

register_converter (also in PieBuilder init) should work without a document_type, but try to infer it from the result type Annotation of the converter if it is a callable)
~~convert_to() should accept a list of document_type and try them one by one (if a taskmodule works with multiple document types)~~ Edit: we added Taskmodule.document_type that returns a single type
PieDatasetBuilder should accept a document_converters Parameter (instead of document_type and converter) that will update DOCUMENT_CONVERTERS before passing it to dataset creation (maybe convert string keys)
convert_to() should be to_document_type() and should not accept a converter
~~swap arguments of issubclass in get_best_dataset_converter_with_types() because we "up-cast" to document_type afterwards (e.g. to add partitions later on)~~ EDIT: This is already the case.
double-check naming:
- convert_to() method EDIT: renamed to to_document_type()
- target_document_type parameter for PieDatasetBuilder EDIT: removed
add tests

… if that is provided

… document_type and converter); rename convert_to() to to_document_type() and it accepts only a (mandatory) document_type parameter

…th_types() to not accept a converter

…) instead of in DocumentDict version

…RelationsAndLabeledPartitions

… == self.document_type

…ssage

ArneBinder added the enhancement New feature or request label Sep 8, 2023

ArneBinder mentioned this pull request Sep 8, 2023

add document converters (simple version) #333

Closed

ArneBinder force-pushed the document_converters branch from d333c1a to 5a0e15b Compare September 11, 2023 13:58

ArneBinder added 23 commits September 11, 2023 17:07

add document_converters

bb110e4

fix test and improve error message

4e6b884

fix naming

de8ee7f

fix documents.py

23d1989

improve convert_to()

40d7a8a

remove the document converters because they are not valid anymore

a5d8a5e

fix reset of document_converters

0c1be09

add target_document_type and document_converter to PieDatasetBuilder

670002a

fix register_document_converter()

87fa88e

allow a list for document_type and try to infer it from the converter…

ec52457

… if that is provided

fix _infer_document_type_from_function_return() for strict=False

8086438

also allow derived document types when looking for registered converter

8eadeee

derive documents from each other

dde2f9f

rearrange document types

698e6a0

add DOCUMENT_TYPE and document_type property to TaskModule

92d7db1

just allow a single entry in document_type parameters

d955c72

PieDatasetBuilder accepts a document_converters parameter (instead of…

6999fa8

… document_type and converter); rename convert_to() to to_document_type() and it accepts only a (mandatory) document_type parameter

fix IterableDataset.to_document_type()

397c01e

simplify dataset_to_document_type() and get_best_dataset_converter_wi…

8861746

…th_types() to not accept a converter

infer document_type in (Iterable)Dataset.register_document_converter(…

e7f5066

…) instead of in DocumentDict version

fix DatasetDict.to_document_type()

989c80c

add log message

256c7a0

try to resolve the converter if it is a string

0efc8b8

ArneBinder force-pushed the document_converters branch from 5a0e15b to 0efc8b8 Compare September 11, 2023 15:08

ArneBinder added 3 commits September 11, 2023 17:16

add TextDocumentWithEntitiesAndRelations and TextDocumentWithEntities…

52548bb

…RelationsAndLabeledPartitions

DatasetDict.to_document_type() does nothing if resolved_document_type…

0fd2048

… == self.document_type

move test_get_pie_dataset_type() and remove import

17607b3

ArneBinder added 5 commits September 12, 2023 16:21

add tests for dataset builder

99adce0

add tests for (Iterable)Dataset.register_document_converter()

54246da

add tests for (Iterable)Dataset.to_document_type()

bebc5d9

add tests for DatasetDict.register_document_converter(); fix error me…

642bbd1

…ssage

add tests for DatasetDict.to_document_type()

d4fc8f5

ArneBinder merged commit a9b1014 into main Sep 12, 2023

ArneBinder deleted the document_converters branch September 12, 2023 16:47

ArneBinder mentioned this pull request Sep 13, 2023

use document converters ArneBinder/pytorch-ie-hydra-template-1#124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add document converters #332

add document converters #332

ArneBinder commented Sep 8, 2023 •

edited

Loading

add document converters #332

add document converters #332

Conversation

ArneBinder commented Sep 8, 2023 • edited Loading

ArneBinder commented Sep 8, 2023 •

edited

Loading