Skip to content

Commit

Permalink
Merge pull request #97 from ArneBinder/add_visualization_for_AM_datas…
Browse files Browse the repository at this point in the history
…et_cards

Add visualization for AM dataset cards.
  • Loading branch information
ArneBinder authored Jan 19, 2024
2 parents e8238db + c310e25 commit 8072a91
Show file tree
Hide file tree
Showing 12 changed files with 81 additions and 58 deletions.
13 changes: 9 additions & 4 deletions dataset_builders/pie/aae2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ Therefore, the `aae2` dataset as described here follows the data structure from

Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)

There is no premise that links to another premise or claim in a different paragraph. That means, an argumentation tree structure is complete within each paragraph. Therefore, it is possible to train a model on the full documents or just at the paragraph-level which is usually less memory-exhaustive (Eger et al., 2017, p. 16).
In the original dataset, there is no premise that links to another premise or claim in a different paragraph. That means, an argumentation tree structure is complete within each paragraph. Therefore, it is possible to train a model on the full documents or just at the paragraph-level which is usually less memory-exhaustive (Eger et al., 2017, p. 16).
However, through our `DOCUMENT_CONVERTERS`, we build links between claims, creating a graph structure throughout an entire essay (see [Document Converters](#document-converters)).

### Supported Tasks and Leaderboards

Expand Down Expand Up @@ -50,7 +51,7 @@ assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)

See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.

### Label Descriptions
### Label Descriptions and Statistics

#### Components

Expand All @@ -64,8 +65,6 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.
- `Claim` constitutes the central component of each argument. Each one has at least one premise and takes stance attribute values "for" or "against" with regarding the major claim.
- `Premise` is the reasons of the argument; either linked to claim or another premise.

**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another.

#### Relations

| Relations | Count | Percentage |
Expand All @@ -79,6 +78,12 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.

See further description in Stab & Gurevych 2017, p.627 and the [annotation guideline](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/guideline.pdf).

**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another. An example of a document is shown here below.

#### Example

![Example](img/sg17f2.png)

### Document Converters

The dataset provides document converters for the following target document types:
Expand Down
Binary file added dataset_builders/pie/aae2/img/sg17f2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 18 additions & 14 deletions dataset_builders/pie/abstrct/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,6 @@ doc = datasets["neoplasm_train"][0]
assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
```

### Document Converters

The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans`
- labels: `MajorClaim`, `Claim`, `Premise`
- `BinraryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations`
- labels: `Support`, `Partial-Attack`, `Attack`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

### Data Splits

| Diseease-based Split | `neoplasm` | `glaucoma` | `mixed` |
Expand All @@ -65,7 +52,7 @@ definitions.
- `mixed_test` contains 20 abstracts on the following diseases: glaucoma, neoplasm, diabetes, hypertension, hepatitis.
- 31 out of 40 abstracts in `mixed_test` overlap with abstracts in `neoplasm_test` and `glaucoma_test`.

### Label Descriptions
### Label Descriptions and Statistics

In this section, we describe labels according to [Mayer et al. (2020)](https://ebooks.iospress.nl/publication/55129), as well as our label counts on 669 abstracts.

Expand Down Expand Up @@ -105,6 +92,23 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table

(Mayer et al. 2020, p.2110)

#### Examples

![Examples](img/abstr-sam.png)

### Document Converters

The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocumentWithMergedSpans`'s `spans`
- labels: `MajorClaim`, `Claim`, `Premise`
- `binary_relations`: `BinaryRelation` annotations, converted from `BratDocumentWithMergedSpans`'s `relations`
- labels: `Support`, `Partial-Attack`, `Attack`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

## Dataset Creation

### Curation Rationale
Expand Down
Binary file added dataset_builders/pie/abstrct/img/abstr-sam.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion dataset_builders/pie/cdcp/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PIE Dataset Card for "CDCP"
# PIE Dataset Card for "cdcp"

This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[CDCP Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/cdcp).
Expand All @@ -24,6 +24,11 @@ See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/a
The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `labeled_spans`: `LabeledSpan` annotations, converted from `CDCPDocument`'s `propositions`
- labels: `fact`, `policy`, `reference`, `testimony`, `value`
- if `propositions` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
- `binary_relations`: `BinaryRelation` annotations, converted from `CDCPDocument`'s `relations`
- labels: `reason`, `evidence`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
65 changes: 33 additions & 32 deletions dataset_builders/pie/sciarg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,41 +42,23 @@ from pie_datasets import load_dataset, builders
# load default version
datasets = load_dataset("pie/sciarg")
doc = datasets["train"][0]
assert isinstance(doc, builders.brat.BratDocument)

# load version with merged span fragments
dataset_merged_spans = load_dataset("pie/sciarg", name="merge_fragmented_spans")
doc_merged_spans = dataset_merged_spans["train"][0]
assert isinstance(doc_merged_spans, builders.brat.BratDocumentWithMergedSpans)
assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
```

### Document Converters

The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `LabeledSpans`, converted from `BratDocument`'s `spans`
- labels: `background_claim`, `own_claim`, `data`
- if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
- `BinraryRelations`, converted from `BratDocument`'s `relations`
- labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same`
- if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting.
- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`
- `LabeledSpans`, as above
- `BinaryRelations`, as above
- `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex.
- labels: `title`, `abstract`, `H1`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

### Data Splits

The dataset consists of a single `train` split that has 40 documents.

For detailed statistics on the corpus, see Lauscher et al. ([2018](<(https://aclanthology.org/W18-5206/)>), p. 43), and the author's [resource analysis](https://github.com/anlausch/sciarg_resource_analysis).

### Label Descriptions
### Label Descriptions and Statistics

In this section, we report our own corpus' statistics; however, there are currently discrepancies in label counts between our report and:

- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43),
- current report above here (labels counted in `BratDocument`'s);

possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment.

#### Components

Expand Down Expand Up @@ -117,14 +99,33 @@ For detailed statistics on the corpus, see Lauscher et al. ([2018](<(https://acl

(*Annotation Guidelines*, pp. 4-6)

**Important note on label counts**:
#### Examples

There are currently discrepancies in label counts between
![sample1](img/leaannof3.png)

- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43),
- current report above here (labels counted in `BratDocument`'s);
Subset of relations in `A01`

possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment.
![sample2](img/sciarg-sam.png)

### Document Converters

The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocument`'s `spans`
- labels: `background_claim`, `own_claim`, `data`
- if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
- `binary_relations`: `BinaryRelation` annotations, converted from `BratDocument`'s `relations`
- labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same`
- if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting.
- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`
- `labeled_spans`, as above
- `binary_relations`, as above
- `labeled_partitions`, `LabeledSpan` annotations, created from splitting `BratDocument`'s `text` at new paragraph in `xml` format.
- labels: `title`, `abstract`, `H1`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

## Dataset Creation

Expand Down
Binary file added dataset_builders/pie/sciarg/img/leaannof3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/sciarg/img/sciarg-sam.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion dataset_builders/pie/scidtb_argmin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t

The document type for this dataset is `SciDTBArgminDocument` which defines the following data fields:

- `tokens` (Tuple of string)
- `tokens` (tuple of string)
- `id` (str, optional)
- `metadata` (dictionary, optional)

Expand All @@ -23,6 +23,11 @@ See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/a
The dataset provides document converters for the following target document types:

- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- `labeled_spans`: `LabeledSpan` annotations, converted from`SciDTBArgminDocument`'s `units`
- labels: `proposal`, `assertion`, `result`, `observation`, `means`, `description`
- tuples of `tokens` are joined with a whitespace to create `text` for `LabeledSpans`
- `binary_relations`: `BinaryRelation` annotations, converted from `SciDTBArgminDocument`'s `relations`
- labels: `support`, `attack`, `additional`, `detail`, `sequence`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
5 changes: 3 additions & 2 deletions tests/dataset_builders/pie/test_argmicro.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
disable_caching()

DATASET_NAME = "argmicro"
BUILDER_CLASS = ArgMicro
SPLIT_SIZES = {"train": 112}
DATA_PATH = FIXTURES_ROOT / "dataset_builders" / "arg-microtexts-master.zip"
HF_DATASET_PATH = ArgMicro.BASE_DATASET_PATH
Expand All @@ -46,7 +47,7 @@ def hf_dataset(dataset_variant):
@pytest.fixture(scope="module")
def generate_document_kwargs(hf_dataset, dataset_variant):
ds = hf_dataset["train"]
return ArgMicro(config_name=dataset_variant)._generate_document_kwargs(ds)
return BUILDER_CLASS(config_name=dataset_variant)._generate_document_kwargs(ds)


def test_hf_dataset(hf_dataset, dataset_variant, generate_document_kwargs):
Expand Down Expand Up @@ -135,7 +136,7 @@ def test_hf_example(hf_example, hf_dataset, dataset_variant, generate_document_k

@pytest.fixture(scope="module")
def generated_document(hf_dataset, dataset_variant, generate_document_kwargs):
return ArgMicro(config_name=dataset_variant)._generate_document(
return BUILDER_CLASS(config_name=dataset_variant)._generate_document(
hf_dataset["train"][0], **generate_document_kwargs
)

Expand Down
5 changes: 3 additions & 2 deletions tests/dataset_builders/pie/test_cdcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
disable_caching()

DATASET_NAME = "cdcp"
BUILDER_CLASS = CDCP
SPLIT_SIZES = {"train": 581, "test": 150}
HF_DATASET_PATH = CDCP.BASE_DATASET_PATH
PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME
Expand Down Expand Up @@ -101,12 +102,12 @@ def test_hf_example(hf_example, split):

@pytest.fixture(scope="module")
def generate_document_kwargs(hf_dataset, split):
return CDCP()._generate_document_kwargs(hf_dataset[split])
return BUILDER_CLASS()._generate_document_kwargs(hf_dataset[split])


@pytest.fixture(scope="module")
def generated_document(hf_example, generate_document_kwargs):
return CDCP()._generate_document(hf_example, **generate_document_kwargs)
return BUILDER_CLASS()._generate_document(hf_example, **generate_document_kwargs)


def test_generated_document(generated_document, split):
Expand Down
5 changes: 3 additions & 2 deletions tests/dataset_builders/pie/test_scidtb_argmin.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
disable_caching()

DATASET_NAME = "scidtb_argmin"
BUILDER_CLASS = SciDTBArgmin
SPLIT_SIZES = {"train": 60}
HF_DATASET_PATH = SciDTBArgmin.BASE_DATASET_PATH
PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME
Expand Down Expand Up @@ -55,12 +56,12 @@ def test_hf_example(hf_example):

@pytest.fixture(scope="module")
def generate_document_kwargs(hf_dataset):
return SciDTBArgmin()._generate_document_kwargs(hf_dataset["train"])
return BUILDER_CLASS()._generate_document_kwargs(hf_dataset["train"])


@pytest.fixture(scope="module")
def generated_document(hf_example, generate_document_kwargs):
return SciDTBArgmin()._generate_document(hf_example, **generate_document_kwargs)
return BUILDER_CLASS()._generate_document(hf_example, **generate_document_kwargs)


def test_generated_document(generated_document):
Expand Down

0 comments on commit 8072a91

Please sign in to comment.