Merge pull request #97 from ArneBinder/add_visualization_for_AM_datas…

…et_cards Add visualization for AM dataset cards.
ArneBinder · Jan 19, 2024 · 8072a91 · 8072a91
2 parents e8238db + c310e25
commit 8072a91
Show file tree

Hide file tree

Showing 12 changed files with 81 additions and 58 deletions.
diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md
@@ -8,7 +8,8 @@ Therefore, the `aae2` dataset as described here follows the data structure from
 
 Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
 
-There is no premise that links to another premise or claim in a different paragraph. That means, an argumentation tree structure is complete within each paragraph. Therefore, it is possible to train a model on the full documents or just at the paragraph-level which is usually less memory-exhaustive (Eger et al., 2017, p. 16).
+In the original dataset, there is no premise that links to another premise or claim in a different paragraph. That means, an argumentation tree structure is complete within each paragraph. Therefore, it is possible to train a model on the full documents or just at the paragraph-level which is usually less memory-exhaustive (Eger et al., 2017, p. 16).
+However, through our `DOCUMENT_CONVERTERS`, we build links between claims, creating a graph structure throughout an entire essay (see [Document Converters](#document-converters)).
 
 ### Supported Tasks and Leaderboards
 
@@ -50,7 +51,7 @@ assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
 
 See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.
 
-### Label Descriptions
+### Label Descriptions and Statistics
 
 #### Components
 
@@ -64,8 +65,6 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.
 - `Claim` constitutes the central component of each argument. Each one has at least one premise and takes stance attribute values "for" or "against" with regarding the major claim.
 - `Premise` is the reasons of the argument; either linked to claim or another premise.
 
-**Note that** relations between `MajorClaim` and  `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another.
-
 #### Relations
 
 | Relations           | Count | Percentage |
@@ -79,6 +78,12 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.
 
 See further description in Stab & Gurevych 2017, p.627 and the [annotation guideline](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/guideline.pdf).
 
+**Note that** relations between `MajorClaim` and  `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another. An example of a document is shown here below.
+
+#### Example
+
+![Example](img/sg17f2.png)
+
 ### Document Converters
 
 The dataset provides document converters for the following target document types:

diff --git a/dataset_builders/pie/aae2/img/sg17f2.png b/dataset_builders/pie/aae2/img/sg17f2.png
diff --git a/dataset_builders/pie/abstrct/README.md b/dataset_builders/pie/abstrct/README.md
@@ -41,19 +41,6 @@ doc = datasets["neoplasm_train"][0]
 assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
 ```
 
-### Document Converters
-
-The dataset provides document converters for the following target document types:
-
-- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
-  - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans`
-    - labels: `MajorClaim`, `Claim`, `Premise`
-  - `BinraryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations`
-    - labels:  `Support`, `Partial-Attack`, `Attack`
-
-See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
-definitions.
-
 ### Data Splits
 
 | Diseease-based Split                                      |              `neoplasm` |           `glaucoma` |              `mixed` |
@@ -65,7 +52,7 @@ definitions.
 - `mixed_test` contains 20 abstracts on the following diseases: glaucoma, neoplasm, diabetes, hypertension, hepatitis.
 - 31 out of 40 abstracts in `mixed_test` overlap with abstracts in `neoplasm_test` and `glaucoma_test`.
 
-### Label Descriptions
+### Label Descriptions and Statistics
 
 In this section, we describe labels according to [Mayer et al. (2020)](https://ebooks.iospress.nl/publication/55129), as well as our label counts on 669 abstracts.
 
@@ -105,6 +92,23 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table
 
 (Mayer et al. 2020, p.2110)
 
+#### Examples
+
+![Examples](img/abstr-sam.png)
+
+### Document Converters
+
+The dataset provides document converters for the following target document types:
+
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+  - `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocumentWithMergedSpans`'s `spans`
+    - labels: `MajorClaim`, `Claim`, `Premise`
+  - `binary_relations`: `BinaryRelation` annotations, converted from `BratDocumentWithMergedSpans`'s `relations`
+    - labels:  `Support`, `Partial-Attack`, `Attack`
+
+See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
+definitions.
+
 ## Dataset Creation
 
 ### Curation Rationale

diff --git a/dataset_builders/pie/abstrct/img/abstr-sam.png b/dataset_builders/pie/abstrct/img/abstr-sam.png
diff --git a/dataset_builders/pie/cdcp/README.md b/dataset_builders/pie/cdcp/README.md
@@ -1,4 +1,4 @@
-# PIE Dataset Card for "CDCP"
+# PIE Dataset Card for "cdcp"
 
 This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
 [CDCP Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/cdcp).
@@ -24,6 +24,11 @@ See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/a
 The dataset provides document converters for the following target document types:
 
 - `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+  - `labeled_spans`: `LabeledSpan` annotations, converted from `CDCPDocument`'s `propositions`
+    - labels: `fact`, `policy`, `reference`, `testimony`, `value`
+    - if `propositions` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
+  - `binary_relations`: `BinaryRelation` annotations, converted from `CDCPDocument`'s `relations`
+    - labels:  `reason`, `evidence`
 
 See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
 definitions.
diff --git a/dataset_builders/pie/sciarg/README.md b/dataset_builders/pie/sciarg/README.md
@@ -42,41 +42,23 @@ from pie_datasets import load_dataset, builders
 # load default version
 datasets = load_dataset("pie/sciarg")
 doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocument)
-
-# load version with merged span fragments
-dataset_merged_spans = load_dataset("pie/sciarg", name="merge_fragmented_spans")
-doc_merged_spans = dataset_merged_spans["train"][0]
-assert isinstance(doc_merged_spans, builders.brat.BratDocumentWithMergedSpans)
+assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
 ```
 
-### Document Converters
-
-The dataset provides document converters for the following target document types:
-
-- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
-  - `LabeledSpans`, converted from `BratDocument`'s `spans`
-    - labels: `background_claim`, `own_claim`, `data`
-    - if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
-  - `BinraryRelations`, converted from `BratDocument`'s `relations`
-    - labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same`
-    - if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting.
-- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`
-  - `LabeledSpans`, as above
-  - `BinaryRelations`, as above
-  - `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex.
-    - labels: `title`, `abstract`, `H1`
-
-See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
-definitions.
-
 ### Data Splits
 
 The dataset consists of a single `train` split that has 40 documents.
 
 For detailed statistics on the corpus, see Lauscher et al. ([2018](<(https://aclanthology.org/W18-5206/)>), p. 43), and the author's [resource analysis](https://github.com/anlausch/sciarg_resource_analysis).
 
-### Label Descriptions
+### Label Descriptions and Statistics
+
+In this section, we report our own corpus' statistics; however, there are currently discrepancies in label counts between our report and:
+
+- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43),
+- current report above here (labels counted in `BratDocument`'s);
+
+possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment.
 
 #### Components
 
@@ -117,14 +99,33 @@ For detailed statistics on the corpus, see Lauscher et al. ([2018](<(https://acl
 
 (*Annotation Guidelines*, pp. 4-6)
 
-**Important note on label counts**:
+#### Examples
 
-There are currently discrepancies in label counts between
+![sample1](img/leaannof3.png)
 
-- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43),
-- current report above here (labels counted in `BratDocument`'s);
+Subset of relations in `A01`
 
-possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment.
+![sample2](img/sciarg-sam.png)
+
+### Document Converters
+
+The dataset provides document converters for the following target document types:
+
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+  - `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocument`'s `spans`
+    - labels: `background_claim`, `own_claim`, `data`
+    - if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
+  - `binary_relations`: `BinaryRelation` annotations, converted from `BratDocument`'s `relations`
+    - labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same`
+    - if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting.
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`
+  - `labeled_spans`, as above
+  - `binary_relations`, as above
+  - `labeled_partitions`, `LabeledSpan` annotations, created from splitting `BratDocument`'s `text` at new paragraph in `xml` format.
+    - labels: `title`, `abstract`, `H1`
+
+See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
+definitions.
 
 ## Dataset Creation
 

diff --git a/dataset_builders/pie/sciarg/img/leaannof3.png b/dataset_builders/pie/sciarg/img/leaannof3.png
diff --git a/dataset_builders/pie/sciarg/img/sciarg-sam.png b/dataset_builders/pie/sciarg/img/sciarg-sam.png
diff --git a/dataset_builders/pie/scidtb_argmin/README.md b/dataset_builders/pie/scidtb_argmin/README.md
@@ -7,7 +7,7 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t
 
 The document type for this dataset is `SciDTBArgminDocument` which defines the following data fields:
 
-- `tokens` (Tuple of string)
+- `tokens` (tuple of string)
 - `id` (str, optional)
 - `metadata` (dictionary, optional)
 
@@ -23,6 +23,11 @@ See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/a
 The dataset provides document converters for the following target document types:
 
 - `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+  - `labeled_spans`: `LabeledSpan` annotations, converted from`SciDTBArgminDocument`'s `units`
+    - labels: `proposal`, `assertion`, `result`, `observation`, `means`, `description`
+    - tuples of `tokens` are joined with a whitespace to create `text` for `LabeledSpans`
+  - `binary_relations`: `BinaryRelation` annotations, converted from `SciDTBArgminDocument`'s `relations`
+    - labels: `support`, `attack`, `additional`, `detail`, `sequence`
 
 See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
 definitions.
diff --git a/tests/dataset_builders/pie/test_argmicro.py b/tests/dataset_builders/pie/test_argmicro.py
@@ -27,6 +27,7 @@
 disable_caching()
 
 DATASET_NAME = "argmicro"
+BUILDER_CLASS = ArgMicro
 SPLIT_SIZES = {"train": 112}
 DATA_PATH = FIXTURES_ROOT / "dataset_builders" / "arg-microtexts-master.zip"
 HF_DATASET_PATH = ArgMicro.BASE_DATASET_PATH
@@ -46,7 +47,7 @@ def hf_dataset(dataset_variant):
 @pytest.fixture(scope="module")
 def generate_document_kwargs(hf_dataset, dataset_variant):
     ds = hf_dataset["train"]
-    return ArgMicro(config_name=dataset_variant)._generate_document_kwargs(ds)
+    return BUILDER_CLASS(config_name=dataset_variant)._generate_document_kwargs(ds)
 
 
 def test_hf_dataset(hf_dataset, dataset_variant, generate_document_kwargs):
@@ -135,7 +136,7 @@ def test_hf_example(hf_example, hf_dataset, dataset_variant, generate_document_k
 
 @pytest.fixture(scope="module")
 def generated_document(hf_dataset, dataset_variant, generate_document_kwargs):
-    return ArgMicro(config_name=dataset_variant)._generate_document(
+    return BUILDER_CLASS(config_name=dataset_variant)._generate_document(
         hf_dataset["train"][0], **generate_document_kwargs
     )
 

diff --git a/tests/dataset_builders/pie/test_cdcp.py b/tests/dataset_builders/pie/test_cdcp.py
@@ -30,6 +30,7 @@
 disable_caching()
 
 DATASET_NAME = "cdcp"
+BUILDER_CLASS = CDCP
 SPLIT_SIZES = {"train": 581, "test": 150}
 HF_DATASET_PATH = CDCP.BASE_DATASET_PATH
 PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME
@@ -101,12 +102,12 @@ def test_hf_example(hf_example, split):
 
 @pytest.fixture(scope="module")
 def generate_document_kwargs(hf_dataset, split):
-    return CDCP()._generate_document_kwargs(hf_dataset[split])
+    return BUILDER_CLASS()._generate_document_kwargs(hf_dataset[split])
 
 
 @pytest.fixture(scope="module")
 def generated_document(hf_example, generate_document_kwargs):
-    return CDCP()._generate_document(hf_example, **generate_document_kwargs)
+    return BUILDER_CLASS()._generate_document(hf_example, **generate_document_kwargs)
 
 
 def test_generated_document(generated_document, split):

diff --git a/tests/dataset_builders/pie/test_scidtb_argmin.py b/tests/dataset_builders/pie/test_scidtb_argmin.py
@@ -26,6 +26,7 @@
 disable_caching()
 
 DATASET_NAME = "scidtb_argmin"
+BUILDER_CLASS = SciDTBArgmin
 SPLIT_SIZES = {"train": 60}
 HF_DATASET_PATH = SciDTBArgmin.BASE_DATASET_PATH
 PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME
@@ -55,12 +56,12 @@ def test_hf_example(hf_example):
 
 @pytest.fixture(scope="module")
 def generate_document_kwargs(hf_dataset):
-    return SciDTBArgmin()._generate_document_kwargs(hf_dataset["train"])
+    return BUILDER_CLASS()._generate_document_kwargs(hf_dataset["train"])
 
 
 @pytest.fixture(scope="module")
 def generated_document(hf_example, generate_document_kwargs):
-    return SciDTBArgmin()._generate_document(hf_example, **generate_document_kwargs)
+    return BUILDER_CLASS()._generate_document(hf_example, **generate_document_kwargs)
 
 
 def test_generated_document(generated_document):