Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add document converters #332

Merged
merged 31 commits into from
Sep 12, 2023
Merged

add document converters #332

merged 31 commits into from
Sep 12, 2023

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Sep 8, 2023

This PR moves the main logic for document preparation, i.e. converting the documents of a loaded dataset to a format that is understood by a taskmodule, to the dataset builder scripts. As a result, that

  1. logic is easily shareable via the Huggingface hub, and
  2. allows for document auto-conversion.

With this PR, document converters can be added to dataset builders either vie the class variable DOCUMENT_CONVERTERS of the builder or dynamically via the parameter document_converters when calling load_dataset() . In addition, Dataset, IterabelDataset, and DatasetDict have these new methods:

  • register_document_converter(converter, document_type=None): register a new converter for a target document type. If the document type is not provided, it is tried to infer it from the converter return type
  • to_document_type(document_type): convert to the document type by using the registered converters.

If the converter is a function, it will be applied via map() to all documents. If it is a dict, it is interpreted as a field name mapping and we perform a simple cast with it (if you just need to rename fields this is much more efficient). If no converter is registered for the requested document type, but for a super class of it, we use that one and cast to the requested document type afterwards. If no valid converter is found at all, we simply cast to the requested document type.

In addition, TaskModules can now define a class parameter DOCUMENT_TYPE (default is None) to signal what document type they require. This value can and should only be accessed via the taskmodule.document_type property. Per default, this just returns the value of DOCUMENT_TYPE, but can be used to implement special variants, e.g. if we want to use sentence partitions, the taskmodule should return TextDocumentWith...AndSentences instead of just TextDocumentWith....

The combination of dataset.to_document_type() and taskmodule.document_type allows for auto-conversion in most of the cases. However, this is fully optional and therefore backwards compatible.

Usage in downstream projects (e.g. created from pytorch-ie-hydra-template):

  • in train.py, evaluate.py and predict.py: call dataset = dataset.to_document_type(taskmodule.document_type) if taskmodule.document_type is defined and dataset.document_type is not already a subclass of it
  • in addition, for dataset configs that require task specific preprocessing (e.g. RE) call DatasetDict.to_document_type(...) via _convert_documents.yaml before these preprocessing steps
  • when using generic dataset scripts (e.g. pie/brat), add converters via the document_converters parameter for load_dataset()

This also adds the following document types to be used as common reference points:

  • TextDocumentWithEntitiesAndRelations
  • TextDocumentWithLabeledEntitiesAndRelations
  • TextDocumentWithLabeledEntitiesRelationsAndLabeledPartitions
  • and more

Todo:

  • register_converter (also in PieBuilder init) should work without a document_type, but try to infer it from the result type Annotation of the converter if it is a callable)
  • convert_to() should accept a list of document_type and try them one by one (if a taskmodule works with multiple document types) Edit: we added Taskmodule.document_type that returns a single type
  • PieDatasetBuilder should accept a document_converters Parameter (instead of document_type and converter) that will update DOCUMENT_CONVERTERS before passing it to dataset creation (maybe convert string keys)
  • convert_to() should be to_document_type() and should not accept a converter
  • swap arguments of issubclass in get_best_dataset_converter_with_types() because we "up-cast" to document_type afterwards (e.g. to add partitions later on) EDIT: This is already the case.
  • double-check naming:
    • convert_to() method EDIT: renamed to to_document_type()
    • target_document_type parameter for PieDatasetBuilder EDIT: removed
  • add tests

@ArneBinder ArneBinder added the enhancement New feature or request label Sep 8, 2023
… document_type and converter); rename convert_to() to to_document_type() and it accepts only a (mandatory) document_type parameter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant