-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add document converters #332
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ArneBinder
force-pushed
the
document_converters
branch
from
September 11, 2023 13:58
d333c1a
to
5a0e15b
Compare
… if that is provided
… document_type and converter); rename convert_to() to to_document_type() and it accepts only a (mandatory) document_type parameter
…th_types() to not accept a converter
…) instead of in DocumentDict version
ArneBinder
force-pushed
the
document_converters
branch
from
September 11, 2023 15:08
5a0e15b
to
0efc8b8
Compare
…RelationsAndLabeledPartitions
… == self.document_type
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR moves the main logic for document preparation, i.e. converting the documents of a loaded dataset to a format that is understood by a taskmodule, to the dataset builder scripts. As a result, that
With this PR, document converters can be added to dataset builders either vie the class variable
DOCUMENT_CONVERTERS
of the builder or dynamically via the parameterdocument_converters
when callingload_dataset()
. In addition,Dataset
,IterabelDataset
, andDatasetDict
have these new methods:register_document_converter(converter, document_type=None)
: register a new converter for a target document type. If the document type is not provided, it is tried to infer it from the converter return typeto_document_type(document_type)
: convert to the document type by using the registered converters.If the converter is a function, it will be applied via
map()
to all documents. If it is a dict, it is interpreted as a field name mapping and we perform a simple cast with it (if you just need to rename fields this is much more efficient). If no converter is registered for the requested document type, but for a super class of it, we use that one and cast to the requested document type afterwards. If no valid converter is found at all, we simply cast to the requested document type.In addition,
TaskModule
s can now define a class parameterDOCUMENT_TYPE
(default isNone
) to signal what document type they require. This value can and should only be accessed via thetaskmodule.document_type
property. Per default, this just returns the value ofDOCUMENT_TYPE
, but can be used to implement special variants, e.g. if we want to use sentence partitions, the taskmodule should returnTextDocumentWith...AndSentences
instead of justTextDocumentWith...
.The combination of
dataset.to_document_type()
andtaskmodule.document_type
allows for auto-conversion in most of the cases. However, this is fully optional and therefore backwards compatible.Usage in downstream projects (e.g. created from
pytorch-ie-hydra-template
):dataset = dataset.to_document_type(taskmodule.document_type)
iftaskmodule.document_type
is defined anddataset.document_type
is not already a subclass of itDatasetDict.to_document_type(...)
via_convert_documents.yaml
before these preprocessing stepspie/brat
), add converters via thedocument_converters
parameter forload_dataset()
This also adds the following document types to be used as common reference points:
TextDocumentWithEntitiesAndRelations
TextDocumentWithLabeledEntitiesAndRelations
TextDocumentWithLabeledEntitiesRelationsAndLabeledPartitions
Todo:
convert_to() should accept a list of document_type and try them one by one (if a taskmodule works with multiple document types)Edit: we addedTaskmodule.document_type
that returns a single typePieDatasetBuilder
should accept adocument_converters
Parameter (instead ofdocument_type
andconverter
) that will updateDOCUMENT_CONVERTERS
before passing it to dataset creation (maybe convert string keys)convert_to()
should beto_document_type()
and should not accept a converterswap arguments ofEDIT: This is already the case.issubclass
inget_best_dataset_converter_with_types()
because we "up-cast" todocument_type
afterwards (e.g. to add partitions later on)convert_to()
method EDIT: renamed toto_document_type()
target_document_type
parameter forPieDatasetBuilder
EDIT: removed