-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #61 from ArneBinder/add_sciarg_dataset
add `sciarg` dataset
- Loading branch information
Showing
12 changed files
with
3,237 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
# PIE Dataset Card for "sciarg" | ||
|
||
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the SciArg dataset ([paper](https://aclanthology.org/W18-5206/) and [data repository](https://github.com/anlausch/sciarg_resource_analysis)). Since the SciArg dataset is published in the [BRAT standoff format](https://brat.nlplab.org/standoff.html), this dataset builder is based on the [PyTorch-IE brat dataset loading script](https://huggingface.co/datasets/pie/brat). | ||
|
||
Therefore, the `sciarg` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat). | ||
|
||
### Dataset Summary | ||
|
||
The SciArg dataset is an extension of the Dr. Inventor corpus (Fisas et al., [2015](https://aclanthology.org/W15-1605.pdf), [2016](https://aclanthology.org/L16-1492.pdf)) with an annotation layer containing | ||
fine-grained argumentative components and relations, believing that argumentation needs to | ||
be studied in combination with other rhetorical aspects. It is the first publicly-available argument-annotated corpus of scientific publications (in English), which allows for joint analyses of argumentation and other | ||
rhetorical dimensions of scientific writing." ([Lauscher et al., 2018](<(https://aclanthology.org/W18-5206/)>), pp. 40-41) | ||
|
||
### Supported Tasks and Leaderboards | ||
|
||
- **Tasks**: Argumentation Mining, Component Identification, Relation Identification | ||
- **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) | ||
|
||
### Languages | ||
|
||
The language in the dataset is English (scientific academic publications on computer graphics). | ||
|
||
### Dataset Variants | ||
|
||
The `sciarg` dataset comes in a single version (`default`) with `BratDocumentWithMergedSpans` as document type. Note, | ||
that this in contrast to the base `brat` dataset, where the document type for the `default` variant is `BratDocument`. | ||
The reason is that the SciArg dataset was published with spans that are just fragmented by whitespace which seems | ||
to be because of the annotation tool used. In the `sciarg` dataset, we merge these fragments, so that the document type | ||
can be `BratDocumentWithMergedSpans` (this is easier to handle for most of the task modules). However, fragmented | ||
spans are conceptually also available in SciArg, but they are marked with the `parts_of_same` relation which are kept | ||
as they are in the `sciarg` (`default`) dataset. | ||
|
||
### Data Schema | ||
|
||
See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). | ||
|
||
### Usage | ||
|
||
```python | ||
from pie_datasets import load_dataset, builders | ||
|
||
# load default version | ||
datasets = load_dataset("pie/sciarg") | ||
doc = datasets["train"][0] | ||
assert isinstance(doc, builders.brat.BratDocument) | ||
|
||
# load version with merged span fragments | ||
dataset_merged_spans = load_dataset("pie/sciarg", name="merge_fragmented_spans") | ||
doc_merged_spans = dataset_merged_spans["train"][0] | ||
assert isinstance(doc_merged_spans, builders.brat.BratDocumentWithMergedSpans) | ||
``` | ||
|
||
### Document Converters | ||
|
||
The dataset provides document converters for the following target document types: | ||
|
||
- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` | ||
- `LabeledSpans`, converted from `BratDocument`'s `spans` | ||
- labels: `background_claim`, `own_claim`, `data` | ||
- if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out. | ||
- `BinraryRelations`, converted from `BratDocument`'s `relations` | ||
- labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same` | ||
- if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting. | ||
- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` | ||
- `LabeledSpans`, as above | ||
- `BinaryRelations`, as above | ||
- `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex. | ||
- labels: `title`, `abstract`, `H1` | ||
|
||
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type | ||
definitions. | ||
|
||
### Data Splits | ||
|
||
The dataset consists of a single `train` split that has 40 documents. | ||
|
||
For detailed statistics on the corpus, see Lauscher et al. ([2018](<(https://aclanthology.org/W18-5206/)>), p. 43), and the author's [resource analysis](https://github.com/anlausch/sciarg_resource_analysis). | ||
|
||
### Label Descriptions | ||
|
||
#### Components | ||
|
||
| Components | Count | Percentage | | ||
| ------------------ | ----: | ---------: | | ||
| `background_claim` | 3291 | 24.2 % | | ||
| `own_claim` | 6004 | 44.2 % | | ||
| `data` | 4297 | 31.6 % | | ||
|
||
- `own_claim` is an argumentative statement that closely relates to the authors’ own work. | ||
- `background_claim` an argumentative statement relating to the background of authors’ work, e.g., about related work or common practices. | ||
- `data` component represents a fact that serves as evidence for or against a claim. Note that references or (factual) examples can also serve as data. | ||
(Lauscher et al. 2018, p.41; following and simplified [Toulmin, 2003](https://www.cambridge.org/core/books/uses-of-argument/26CF801BC12004587B66778297D5567C)) | ||
|
||
#### Relations | ||
|
||
| Relations | Count | Percentage | | ||
| -------------------------- | ----: | ---------: | | ||
| support: `support` | 5791 | 74.0 % | | ||
| attack: `contradict` | 697 | 8.9 % | | ||
| other: `semantically_same` | 44 | 0.6 % | | ||
| other: `parts_of_same` | 1298 | 16.6 % | | ||
|
||
##### Argumentative relations | ||
|
||
- `support`: | ||
- if the assumed veracity of *b* increases with the veracity of *a* | ||
- "Usually, this relationship exists from data to claim, but in many cases a claim might support another claim. Other combinations are still possible." - (*Annotation Guidelines*, p. 3) | ||
- `contradict`: | ||
- if the assumed veracity of *b* decreases with the veracity of *a* | ||
- It is a **bi-directional**, i.e., symmetric relationship. | ||
|
||
##### Non-argumentative relations | ||
|
||
- `semantically_same`: between two mentions of effectively the same claim or data component. Can be seen as *argument coreference*, analogous to entity, and *event coreference*. This relation is considered symmetric (i.e., **bidirectional**) and non-argumentative. | ||
(Lauscher et al. 2018, p.41; following [Dung, 1995](https://www.sciencedirect.com/science/article/pii/000437029400041X?via%3Dihub)) | ||
- `parts_of_same`: when a single component is split up in several parts. It is **non-argumentative**, **bidirectional**, but also **intra-component** | ||
|
||
(*Annotation Guidelines*, pp. 4-6) | ||
|
||
**Important note on label counts**: | ||
|
||
There are currently discrepancies in label counts between | ||
|
||
- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43), | ||
- current report above here (labels counted in `BratDocument`'s); | ||
|
||
possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment. | ||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
"\[C\]omputational methods for analyzing scientific writing are becoming paramount...there is no publicly available corpus of scientific publications (in English), annotated with fine-grained argumentative structures. ...\[A\]rgumentative structure of scientific publications should not be studied in isolation, but rather in relation to other rhetorical aspects, such as the | ||
discourse structure. | ||
(Lauscher et al. 2018, p. 40) | ||
|
||
### Source Data | ||
|
||
#### Initial Data Collection and Normalization | ||
|
||
"\[W\]e randomly selected a set of 40 documents, available in PDF format, among a bigger collection provided by experts in the domain, who pre-selected a representative sample of articles in Computer Graphics. Articles were classified into four important subjects in this area: Skinning, Motion Capture, Fluid Simulation and Cloth Simulation. We included in the corpus 10 highly representative articles for each subject." (Fisas et al. 2015, p. 44) | ||
|
||
"The Corpus includes 10,789 sentences, with an average of 269.7 sentences per document." (p. 45) | ||
|
||
#### Who are the source language producers? | ||
|
||
It can be implied from the data source that the language producers were academics in computer graphics and related fields, possibly assisted by other human editors. | ||
|
||
### Annotations | ||
|
||
#### Annotation process | ||
|
||
"We trained the four annotators in a calibration phase, consisting of five iterations, in each of which all annotators annotated one publication. After each iteration we computed the inter-annotator agreement (IAA), discussed the disagreements, and, if needed, adjourned the [annotation guidelines](https://data.dws.informatik.uni-mannheim.de/sci-arg/annotation_guidelines.pdf)." | ||
|
||
The detailed evolution of IAA over the five calibration iterations is depicted in Lauscher et al. (2018), p. 42, Figure 1. | ||
|
||
The annotation were done using BRAT Rapid Annotation Tool ([Stenetorp et al., 2012](https://aclanthology.org/E12-2021/)). | ||
|
||
#### Who are the annotators? | ||
|
||
"We hired one expert (a researcher in computational linguistics) and three non-expert annotators (humanities and social sciences scholars)." (Lauscher et al. 2018, p. 42) | ||
|
||
### Personal and Sensitive Information | ||
|
||
\[More Information Needed\] | ||
|
||
## Considerations for Using the Data | ||
|
||
### Social Impact of Dataset | ||
|
||
"To support learning-based models for automated analysis of scientific publications, potentially leading to better understanding | ||
of the different rhetorical aspects of scientific language (which we dub *scitorics*)." (Lauscher et al. 2018, p. 40) | ||
|
||
"The resulting corpus... is, to the best of our knowledge, the first argument-annotated corpus of scientific publications in English, enables (1) computational analysis of argumentation in scientific writing and (2) integrated analysis of argumentation and other rhetorical aspects of scientific text." (Lauscher et al. 2018, p. 44) | ||
|
||
### Discussion of Biases | ||
|
||
"...not all claims are supported and secondly, claims can be supported by other claims. There are many more supports than contradicts relations." | ||
|
||
"While the background claims and own claims are on average of similar length (85 and 87 characters, respectively), they are much longer than data components (average of 25 characters)." | ||
|
||
"\[A\]nnotators identified an average of 141 connected component per publication...This indicates that either authors write very short argumentative chains or that our annotators had difficulties noticing long-range argumentative dependencies." | ||
|
||
(Lauscher et al. 2018, p.43) | ||
|
||
### Other Known Limitations | ||
|
||
"Expectedly, we observe higher agreements with more calibration. The agreement on argumentative relations is 23% lower than on the components, which we think is due to the high ambiguity of argumentation structures." | ||
|
||
"Additionally, disagreements in component identification are propagated to relations as well, since the agreement on a relation implies the agreement on annotated components at both ends of the relation." | ||
|
||
(Lauscher et al. 2018, p. 43) | ||
|
||
## Additional Information | ||
|
||
### Dataset Curators | ||
|
||
- **Repository:** [https://github.com/anlausch/ArguminSci](https://github.com/anlausch/ArguminSci) | ||
|
||
### Licensing Information | ||
|
||
[MIT License](https://github.com/anlausch/ArguminSci/blob/master/LICENSE) | ||
|
||
This research was partly funded by the German Research Foundation (DFG), grant number EC 477/5-1 (LOC-DB). | ||
|
||
### Citation Information | ||
|
||
``` | ||
@inproceedings{lauscher2018b, | ||
title = {An argument-annotated corpus of scientific publications}, | ||
booktitle = {Proceedings of the 5th Workshop on Mining Argumentation}, | ||
publisher = {Association for Computational Linguistics}, | ||
author = {Lauscher, Anne and Glava\v{s}, Goran and Ponzetto, Simone Paolo}, | ||
address = {Brussels, Belgium}, | ||
year = {2018}, | ||
pages = {40–46} | ||
} | ||
``` | ||
|
||
``` | ||
@inproceedings{lauscher2018a, | ||
title = {ArguminSci: A Tool for Analyzing Argumentation and Rhetorical Aspects in Scientific Writing}, | ||
booktitle = {Proceedings of the 5th Workshop on Mining Argumentation}, | ||
publisher = {Association for Computational Linguistics}, | ||
author = {Lauscher, Anne and Glava\v{s}, Goran and Eckert, Kai}, | ||
address = {Brussels, Belgium}, | ||
year = {2018}, | ||
pages = {22–28} | ||
} | ||
``` | ||
|
||
### Contributions | ||
|
||
Thanks to [@ArneBinder](https://github.com/ArneBinder) and [@idalr](https://github.com/idalr) for adding this dataset. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
pie-datasets>=0.6.0,<0.9.0 | ||
pie-modules>=0.8.0,<0.9.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
from pie_modules.document.processing import ( | ||
RegexPartitioner, | ||
RelationArgumentSorter, | ||
TextSpanTrimmer, | ||
) | ||
from pytorch_ie.core import Document | ||
from pytorch_ie.documents import ( | ||
TextDocumentWithLabeledSpansAndBinaryRelations, | ||
TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, | ||
) | ||
|
||
from pie_datasets.builders import BratBuilder, BratConfig | ||
from pie_datasets.builders.brat import BratDocumentWithMergedSpans | ||
from pie_datasets.document.processing import Caster, Pipeline | ||
|
||
URL = "http://data.dws.informatik.uni-mannheim.de/sci-arg/compiled_corpus.zip" | ||
SPLIT_PATHS = {"train": "compiled_corpus"} | ||
|
||
|
||
def get_common_pipeline_steps(target_document_type: type[Document]) -> dict: | ||
return dict( | ||
cast=Caster( | ||
document_type=target_document_type, | ||
field_mapping={"spans": "labeled_spans", "relations": "binary_relations"}, | ||
), | ||
trim_adus=TextSpanTrimmer(layer="labeled_spans"), | ||
sort_symmetric_relation_arguments=RelationArgumentSorter( | ||
relation_layer="binary_relations", | ||
label_whitelist=["parts_of_same", "semantically_same"], | ||
), | ||
) | ||
|
||
|
||
class SciArg(BratBuilder): | ||
BASE_DATASET_PATH = "DFKI-SLT/brat" | ||
BASE_DATASET_REVISION = "844de61e8a00dc6a93fc29dc185f6e617131fbf1" | ||
|
||
# Overwrite the default config to merge the span fragments. | ||
# The span fragments in SciArg come just from the new line splits, so we can merge them. | ||
# Actual span fragments are annotated via "parts_of_same" relations. | ||
BUILDER_CONFIGS = [ | ||
BratConfig(name=BratBuilder.DEFAULT_CONFIG_NAME, merge_fragmented_spans=True), | ||
] | ||
DOCUMENT_TYPES = { | ||
BratBuilder.DEFAULT_CONFIG_NAME: BratDocumentWithMergedSpans, | ||
} | ||
|
||
# we need to add None to the list of dataset variants to support the default dataset variant | ||
BASE_BUILDER_KWARGS_DICT = { | ||
dataset_variant: {"url": URL, "split_paths": SPLIT_PATHS} | ||
for dataset_variant in ["default", "merge_fragmented_spans", None] | ||
} | ||
|
||
DOCUMENT_CONVERTERS = { | ||
TextDocumentWithLabeledSpansAndBinaryRelations: Pipeline( | ||
**get_common_pipeline_steps(TextDocumentWithLabeledSpansAndBinaryRelations) | ||
), | ||
TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions: Pipeline( | ||
**get_common_pipeline_steps( | ||
TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions | ||
), | ||
add_partitions=RegexPartitioner( | ||
partition_layer_name="labeled_partitions", | ||
pattern="<([^>/]+)>.*</\\1>", | ||
label_group_id=1, | ||
label_whitelist=["Title", "Abstract", "H1"], | ||
skip_initial_partition=True, | ||
strip_whitespace=True, | ||
), | ||
), | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.