-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement
concatenate_dataset_dicts
(#153)
* implement concatenate_dataset_dicts * add tests * wipe metadata from docs in `concatenate_datasets` + add metadata to test datasets * add feature check in `test_to_document_type_function` * Fix `map()` when no function used at all. * remove features not declared in the target document type * add parameter `clean_metadata` to `concatenate_datasets` and `concatenate_dataset_dicts` --------- Co-authored-by: Arne Binder <[email protected]>
- Loading branch information
1 parent
8fc0cc9
commit 34fff5d
Showing
10 changed files
with
149 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"document_type": "pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations" | ||
} |
3 changes: 3 additions & 0 deletions
3
tests/fixtures/dataset_dict/comagc_extract/train/documents.jsonl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{"text": "Thus, FGF6 is increased in PIN and prostate cancer and can promote the proliferation of the transformed prostatic epithelial cells via paracrine and autocrine mechanisms.", "id": "10945637.s12", "metadata": {"CCS": "normalTOcancer", "CGE": "increased", "IGE": "unchanged", "PT": "causality", "cancer_type": "prostate", "expression_change_keyword_1": {"name": "\nNone\n", "pos": null, "type": null}, "expression_change_keyword_2": {"name": "increased", "pos": [14, 22], "type": "Positive_regulation"}}, "labeled_spans": {"annotations": [{"start": 6, "end": 10, "label": "GENE", "score": 1.0, "_id": -4685428526827816387}, {"start": 35, "end": 50, "label": "CANCER", "score": 1.0, "_id": -611854743241672378}], "predictions": []}, "binary_relations": {"annotations": [{"head": -4685428526827816387, "tail": -611854743241672378, "label": "oncogene", "score": 1.0, "_id": -1790325547764256303}], "predictions": []}} | ||
{"text": "Isolation and characterization of the major form of human MUC18 cDNA gene and correlation of MUC18 over-expression in prostate cancer cell lines and tissues with malignant progression.", "id": "11722842.s0", "metadata": {"CCS": "normalTOcancer", "CGE": "increased", "IGE": "unchanged", "PT": "observation", "cancer_type": "prostate", "expression_change_keyword_1": {"name": "over-expression", "pos": [99, 113], "type": "Gene_expression"}, "expression_change_keyword_2": {"name": "over-expression", "pos": [99, 113], "type": "Positive_regulation"}}, "labeled_spans": {"annotations": [{"start": 93, "end": 98, "label": "GENE", "score": 1.0, "_id": -2017777239235151954}, {"start": 118, "end": 133, "label": "CANCER", "score": 1.0, "_id": 4129617449961559606}], "predictions": []}, "binary_relations": {"annotations": [{"head": -2017777239235151954, "tail": 4129617449961559606, "label": "biomarker", "score": 1.0, "_id": 7993340717186791454}], "predictions": []}} | ||
{"text": "We therefore conclude that MUC18 expression is increased during prostate cancer initiation (high grade PIN) and progression to carcinoma, and in metastatic cell lines and metastatic carcinoma.", "id": "11722842.s13", "metadata": {"CCS": "normalTOcancer", "CGE": "increased", "IGE": "unchanged", "PT": "observation", "cancer_type": "prostate", "expression_change_keyword_1": {"name": "expression", "pos": [33, 42], "type": "Gene_expression"}, "expression_change_keyword_2": {"name": "increased", "pos": [47, 55], "type": "Positive_regulation"}}, "labeled_spans": {"annotations": [{"start": 27, "end": 32, "label": "GENE", "score": 1.0, "_id": 5431679980839797458}, {"start": 64, "end": 79, "label": "CANCER", "score": 1.0, "_id": 1650882012654160466}], "predictions": []}, "binary_relations": {"annotations": [{"head": 5431679980839797458, "tail": 1650882012654160466, "label": "biomarker", "score": 1.0, "_id": -6073164971037079930}], "predictions": []}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{ | ||
"document_type": "pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations" | ||
} |
3 changes: 3 additions & 0 deletions
3
tests/fixtures/dataset_dict/tbga_extract/test/documents.jsonl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{"text": "In addition, the combined cancer genome expression metaanalysis datasets included PDE11A among the top 1% down-regulated genes in PCa.", "id": null, "metadata": {"entity_ids": ["50940", "C0006826"], "entity_names": ["PDE11A", "Malignant Neoplasms"]}, "labeled_spans": {"annotations": [{"start": 82, "end": 88, "label": "ENTITY", "score": 1.0, "_id": -924809712458378694}, {"start": 26, "end": 32, "label": "ENTITY", "score": 1.0, "_id": -8300559430683946006}], "predictions": []}, "binary_relations": {"annotations": [{"head": -924809712458378694, "tail": -8300559430683946006, "label": "NA", "score": 1.0, "_id": -1873235480272460116}], "predictions": []}} | ||
{"text": "We conclude that the CYGB gene is regulated by both promoter methylation and tumour hypoxia in HNSCC and that increased expression of this gene correlates with clincopathological measures of a tumour's biological aggression.", "id": null, "metadata": {"entity_ids": ["114757", "C0001807"], "entity_names": ["CYGB", "Aggressive behavior"]}, "labeled_spans": {"annotations": [{"start": 21, "end": 30, "label": "ENTITY", "score": 1.0, "_id": 4471756672664549063}, {"start": 213, "end": 223, "label": "ENTITY", "score": 1.0, "_id": -3820234498234956495}], "predictions": []}, "binary_relations": {"annotations": [{"head": 4471756672664549063, "tail": -3820234498234956495, "label": "NA", "score": 1.0, "_id": -1529179093863665121}], "predictions": []}} | ||
{"text": "Thus, the role of SIVA in tumorigenesis remains unclear.", "id": null, "metadata": {"entity_ids": ["10572", "C0007621"], "entity_names": ["SIVA1", "Neoplastic Cell Transformation"]}, "labeled_spans": {"annotations": [{"start": 18, "end": 22, "label": "ENTITY", "score": 1.0, "_id": 3174421471102386276}, {"start": 26, "end": 39, "label": "ENTITY", "score": 1.0, "_id": -6496953722761076655}], "predictions": []}, "binary_relations": {"annotations": [{"head": 3174421471102386276, "tail": -6496953722761076655, "label": "NA", "score": 1.0, "_id": 2920545352474864205}], "predictions": []}} |
3 changes: 3 additions & 0 deletions
3
tests/fixtures/dataset_dict/tbga_extract/train/documents.jsonl
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
{"text": "A monocyte chemoattractant protein-1 gene polymorphism is associated with occult ischemia in a high-risk asymptomatic population.", "id": null, "metadata": {"entity_ids": ["6347", "C0231221"], "entity_names": ["CCL2", "Asymptomatic"]}, "labeled_spans": {"annotations": [{"start": 2, "end": 36, "label": "ENTITY", "score": 1.0, "_id": 5426963144202911262}, {"start": 105, "end": 117, "label": "ENTITY", "score": 1.0, "_id": 8375553621315725498}], "predictions": []}, "binary_relations": {"annotations": [{"head": 5426963144202911262, "tail": 8375553621315725498, "label": "NA", "score": 1.0, "_id": 8597812253194613001}], "predictions": []}} | ||
{"text": "This study examined the effects of Her2 blockade on tumor angiogenesis, vascular architecture, and hypoxia in Her2(+) and Her2(-) MCF7 xenograft tumors.", "id": null, "metadata": {"entity_ids": ["2064", "C0242184"], "entity_names": ["ERBB2", "Hypoxia"]}, "labeled_spans": {"annotations": [{"start": 122, "end": 126, "label": "ENTITY", "score": 1.0, "_id": 8449701248948288217}, {"start": 99, "end": 106, "label": "ENTITY", "score": 1.0, "_id": -971867574717604855}], "predictions": []}, "binary_relations": {"annotations": [{"head": 8449701248948288217, "tail": -971867574717604855, "label": "NA", "score": 1.0, "_id": -2442696185288775855}], "predictions": []}} | ||
{"text": "Eleven deleterious variants, six nonsense and five missense, were identified in seven genes: four LCA-associated genes (CEP290, IQCB1, NMNAT1, and RPGRIP1), one gene responsible for syndromic LCA (ALMS1), and two IRDs-related genes (CTNNA1 and CYP4V2).", "id": null, "metadata": {"entity_ids": ["80184", "C2931258"], "entity_names": ["CEP290", "Amaurosis congenita of Leber, type 1"]}, "labeled_spans": {"annotations": [{"start": 120, "end": 126, "label": "ENTITY", "score": 1.0, "_id": 3602497405587057427}, {"start": 98, "end": 101, "label": "ENTITY", "score": 1.0, "_id": 2172619519622247379}], "predictions": []}, "binary_relations": {"annotations": [{"head": 3602497405587057427, "tail": 2172619519622247379, "label": "genomic_alterations", "score": 1.0, "_id": 8689688816868215711}], "predictions": []}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters