Skip to content

Commit

Permalink
update all pre-commit hooks to via autoupdate
Browse files Browse the repository at this point in the history
  • Loading branch information
ArneBinder committed Jan 6, 2025
1 parent 9d8030d commit ed97c40
Show file tree
Hide file tree
Showing 11 changed files with 64 additions and 66 deletions.
20 changes: 10 additions & 10 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ exclude: '^tests/fixtures/.*|^data/.*'

repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
rev: v5.0.0
hooks:
# list of supported hooks: https://pre-commit.com/hooks.html
- id: trailing-whitespace
Expand All @@ -23,21 +23,21 @@ repos:

# python code formatting
- repo: https://github.com/psf/black
rev: 23.7.0
rev: 24.10.0
hooks:
- id: black
args: [--line-length, "99"]

# python import sorting
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
rev: 5.13.2
hooks:
- id: isort
args: ["--profile", "black", "--filter-files"]

# python upgrading syntax to newer version
- repo: https://github.com/asottile/pyupgrade
rev: v3.9.0
rev: v3.19.1
hooks:
- id: pyupgrade
args: [--py38-plus]
Expand All @@ -46,17 +46,17 @@ repos:
- repo: https://github.com/myint/docformatter
# rev: v1.7.6
# as long as https://github.com/PyCQA/docformatter/pull/287 is not yet released
rev: 06907d0
rev: v1.7.5
hooks:
- id: docformatter
args: [--in-place, --wrap-summaries=99, --wrap-descriptions=99]

# python check (PEP8), programming errors and code complexity
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
rev: 7.1.1
hooks:
- id: flake8
args: ["--ignore", "E501,F401,F841,W503,E203", "--extend-select", "W504", "--exclude", "logs/*"]
args: ["--ignore", "E501,F401,F841,W503,E203,E704", "--extend-select", "W504", "--exclude", "logs/*"]

# python security linter
# - repo: https://github.com/PyCQA/bandit
Expand All @@ -68,7 +68,7 @@ repos:

# md formatting
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.16
rev: 0.7.21
hooks:
- id: mdformat
args: ["--number"]
Expand All @@ -81,7 +81,7 @@ repos:

# word spelling linter
- repo: https://github.com/codespell-project/codespell
rev: v2.2.5
rev: v2.3.0
hooks:
- id: codespell
args:
Expand All @@ -92,7 +92,7 @@ repos:

# python static type checking
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.4.1
rev: v1.14.1
hooks:
- id: mypy
files: src
Expand Down
16 changes: 8 additions & 8 deletions dataset_builders/pie/aae2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,11 +53,11 @@ See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema)

### Data Splits

| Statistics | Train | Test |
| ---------------------------------------------------------------- | -------------------------: | -----------------------: |
| No. of document | 322 | 80 |
| Components <br/>- `MajorClaim`<br/>- `Claim`<br/>- `Premise` | <br/>598<br/>1202<br/>3023 | <br/>153<br/>304<br/>809 |
| Relations\*<br/>- `supports`<br/>- `attacks` | <br/>3820<br/>405 | <br/>1021<br/>92 |
| Statistics | Train | Test |
| ------------------------------------------------------------ | -------------------------: | -----------------------: |
| No. of document | 322 | 80 |
| Components <br/>- `MajorClaim`<br/>- `Claim`<br/>- `Premise` | <br/>598<br/>1202<br/>3023 | <br/>153<br/>304<br/>809 |
| Relations\*<br/>- `supports`<br/>- `attacks` | <br/>3820<br/>405 | <br/>1021<br/>92 |

\* included all relations between claims and premises and all claim attributions.

Expand Down Expand Up @@ -90,7 +90,7 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1.

See further description in Stab & Gurevych 2017, p.627 and the [annotation guideline](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/guideline.pdf).

**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another. An example of a document is shown here below.
**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another. An example of a document is shown here below.

#### Example

Expand Down Expand Up @@ -351,7 +351,7 @@ Three non-native speakers; one of the three being an expert annotator.

### Social Impact of Dataset

"\[Computational Argumentation\] have
"[Computational Argumentation] have
broad application potential in various areas such as legal decision support (Mochales-Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al.
2015)." (p. 619)

Expand All @@ -366,7 +366,7 @@ The relations between claims and major claims are not explicitly annotated.
"The proportion of non-argumentative text amounts to 47,474 tokens (32.2%) and
1,631 sentences (22.9%). The number of sentences with several argument components
is 583, of which 302 include several components with different types (e.g., a claim followed by premise)...
\[T\]he identification of argument components requires the
[T]he identification of argument components requires the
separation of argumentative from non-argumentative text units and the recognition of
component boundaries at the token level...The proportion of paragraphs with unlinked
argument components (e.g., unsupported claims without incoming relations) is 421
Expand Down
16 changes: 8 additions & 8 deletions dataset_builders/pie/abstrct/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ The dataset provides document converters for the following target document types
- `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans`
- labels: `MajorClaim`, `Claim`, `Premise`
- `BinraryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations`
- labels: `Support`, `Partial-Attack`, `Attack`
- labels: `Support`, `Partial-Attack`, `Attack`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions.

Expand Down Expand Up @@ -93,7 +93,7 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table

- `MajorClaim` are more general/concluding `claim`'s, which is supported by more specific claims
- `Claim` is a concluding statement made by the author about the outcome of the study. Claims only points to other claims.
- `Premise` (a.k.a. evidence) is an observation or measurement in the study, which supports or attacks another argument component, usually a `claim`. They are observed facts, and therefore credible without further justifications, as this is the ground truth the argumentation is based on.
- `Premise` (a.k.a. evidence) is an observation or measurement in the study, which supports or attacks another argument component, usually a `claim`. They are observed facts, and therefore credible without further justifications, as this is the ground truth the argumentation is based on.

(Mayer et al. 2020, p.2110)

Expand Down Expand Up @@ -354,7 +354,7 @@ python src/evaluate_documents.py dataset=abstrct_base metric=count_text_tokens

### Curation Rationale

"\[D\]espite its natural employment in healthcare applications, only few approaches have applied AM methods to this kind
"[D]espite its natural employment in healthcare applications, only few approaches have applied AM methods to this kind
of text, and their contribution is limited to the detection
of argument components, disregarding the more complex phase of
predicting the relations among them. In addition, no huge annotated
Expand All @@ -373,7 +373,7 @@ Extended from the previous dataset in [Mayer et al. 2018](https://webusers.i3s.u

#### Who are the source language producers?

\[More Information Needed\]
[More Information Needed]

### Annotations

Expand Down Expand Up @@ -405,7 +405,7 @@ Two annotators with background in computational linguistics. No information was

### Personal and Sensitive Information

\[More Information Needed\]
[More Information Needed]

## Considerations for Using the Data

Expand All @@ -426,17 +426,17 @@ scale." (p. 2114)

### Discussion of Biases

\[More Information Needed\]
[More Information Needed]

### Other Known Limitations

\[More Information Needed\]
[More Information Needed]

## Additional Information

### Dataset Curators

\[More Information Needed\]
[More Information Needed]

### Licensing Information

Expand Down
2 changes: 1 addition & 1 deletion dataset_builders/pie/cdcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The dataset provides document converters for the following target document types
- labels: `fact`, `policy`, `reference`, `testimony`, `value`
- if `propositions` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
- `binary_relations`: `BinaryRelation` annotations, converted from `CDCPDocument`'s `relations`
- labels: `reason`, `evidence`
- labels: `reason`, `evidence`

See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -447,9 +447,9 @@ def _generate_document_kwargs(self, dataset):
pos_tags_feature = dataset.features["sentences"][0]["pos_tags"].feature
return dict(
entity_labels=dataset.features["sentences"][0]["named_entities"].feature,
pos_tag_labels=pos_tags_feature
if isinstance(pos_tags_feature, datasets.ClassLabel)
else None,
pos_tag_labels=(
pos_tags_feature if isinstance(pos_tags_feature, datasets.ClassLabel) else None
),
)

def _generate_document(self, example, **document_kwargs):
Expand Down
10 changes: 5 additions & 5 deletions dataset_builders/pie/sciarg/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) prese

- `supports`:
- if the assumed veracity of *b* increases with the veracity of *a*
- "Usually, this relationship exists from data to claim, but in many cases a claim might support another claim. Other combinations are still possible." - (*Annotation Guidelines*, p. 3)
- "Usually, this relationship exists from data to claim, but in many cases a claim might support another claim. Other combinations are still possible." - (*Annotation Guidelines*, p. 3)
- `contradicts`:
- if the assumed veracity of *b* decreases with the veracity of *a*
- It is a **bi-directional**, i.e., symmetric relationship.
Expand Down Expand Up @@ -335,15 +335,15 @@ python src/evaluate_documents.py dataset=sciarg_base metric=count_text_tokens

### Curation Rationale

"\[C\]omputational methods for analyzing scientific writing are becoming paramount...there is no publicly available corpus of scientific publications (in English), annotated with fine-grained argumentative structures. ...\[A\]rgumentative structure of scientific publications should not be studied in isolation, but rather in relation to other rhetorical aspects, such as the
"[C]omputational methods for analyzing scientific writing are becoming paramount...there is no publicly available corpus of scientific publications (in English), annotated with fine-grained argumentative structures. ...[A]rgumentative structure of scientific publications should not be studied in isolation, but rather in relation to other rhetorical aspects, such as the
discourse structure.
(Lauscher et al. 2018, p. 40)

### Source Data

#### Initial Data Collection and Normalization

"\[W\]e randomly selected a set of 40 documents, available in PDF format, among a bigger collection provided by experts in the domain, who pre-selected a representative sample of articles in Computer Graphics. Articles were classified into four important subjects in this area: Skinning, Motion Capture, Fluid Simulation and Cloth Simulation. We included in the corpus 10 highly representative articles for each subject." (Fisas et al. 2015, p. 44)
"[W]e randomly selected a set of 40 documents, available in PDF format, among a bigger collection provided by experts in the domain, who pre-selected a representative sample of articles in Computer Graphics. Articles were classified into four important subjects in this area: Skinning, Motion Capture, Fluid Simulation and Cloth Simulation. We included in the corpus 10 highly representative articles for each subject." (Fisas et al. 2015, p. 44)

"The Corpus includes 10,789 sentences, with an average of 269.7 sentences per document." (p. 45)

Expand All @@ -367,7 +367,7 @@ The annotation were done using BRAT Rapid Annotation Tool ([Stenetorp et al., 20

### Personal and Sensitive Information

\[More Information Needed\]
[More Information Needed]

## Considerations for Using the Data

Expand All @@ -384,7 +384,7 @@ of the different rhetorical aspects of scientific language (which we dub *scitor

"While the background claims and own claims are on average of similar length (85 and 87 characters, respectively), they are much longer than data components (average of 25 characters)."

"\[A\]nnotators identified an average of 141 connected component per publication...This indicates that either authors write very short argumentative chains or that our annotators had difficulties noticing long-range argumentative dependencies."
"[A]nnotators identified an average of 141 connected component per publication...This indicates that either authors write very short argumentative chains or that our annotators had difficulties noticing long-range argumentative dependencies."

(Lauscher et al. 2018, p.43)

Expand Down
18 changes: 6 additions & 12 deletions src/pie_datasets/core/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,12 +176,10 @@ def _generate_example_kwargs(
return None # pragma: no cover

@overload # type: ignore
def _convert_dataset_single(self, dataset: datasets.IterableDataset) -> IterableDataset:
...
def _convert_dataset_single(self, dataset: datasets.IterableDataset) -> IterableDataset: ...

@overload # type: ignore
def _convert_dataset_single(self, dataset: datasets.Dataset) -> Dataset:
...
def _convert_dataset_single(self, dataset: datasets.Dataset) -> Dataset: ...

def _convert_dataset_single(
self, dataset: Union[datasets.Dataset, datasets.IterableDataset]
Expand All @@ -204,22 +202,18 @@ def _convert_dataset_single(
return result

@overload # type: ignore
def _convert_datasets(self, datasets: datasets.DatasetDict) -> datasets.DatasetDict:
...
def _convert_datasets(self, datasets: datasets.DatasetDict) -> datasets.DatasetDict: ...

@overload # type: ignore
def _convert_datasets(
self, datasets: datasets.IterableDatasetDict
) -> datasets.IterableDatasetDict:
...
) -> datasets.IterableDatasetDict: ...

@overload # type: ignore
def _convert_datasets(self, datasets: datasets.IterableDataset) -> IterableDataset:
...
def _convert_datasets(self, datasets: datasets.IterableDataset) -> IterableDataset: ...

@overload # type: ignore
def _convert_datasets(self, datasets: datasets.Dataset) -> Dataset:
...
def _convert_datasets(self, datasets: datasets.Dataset) -> Dataset: ...

def _convert_datasets(
self,
Expand Down
26 changes: 14 additions & 12 deletions src/pie_datasets/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,17 +179,15 @@ def dataset_to_document_type(
dataset: "Dataset",
document_type: Type[Document],
**kwargs,
) -> "Dataset":
...
) -> "Dataset": ...


@overload
def dataset_to_document_type(
dataset: "IterableDataset",
document_type: Type[Document],
**kwargs,
) -> "IterableDataset":
...
) -> "IterableDataset": ...


def dataset_to_document_type(
Expand Down Expand Up @@ -383,9 +381,11 @@ def map(
result_document_type: Optional[Type[Document]] = None,
) -> "Dataset":
dataset = super().map(
function=decorate_convert_to_dict_of_lists(function)
if as_documents and function is not None
else function,
function=(
decorate_convert_to_dict_of_lists(function)
if as_documents and function is not None
else function
),
with_indices=with_indices,
with_rank=with_rank,
input_columns=input_columns,
Expand Down Expand Up @@ -588,11 +588,13 @@ def map( # type: ignore
**kwargs,
) -> "IterableDataset":
dataset_mapped = super().map(
function=decorate_convert_to_document_and_back(
function, document_type=self.document_type, batched=batched
)
if as_documents and function is not None
else function,
function=(
decorate_convert_to_document_and_back(
function, document_type=self.document_type, batched=batched
)
if as_documents and function is not None
else function
),
batched=batched,
**kwargs,
)
Expand Down
2 changes: 1 addition & 1 deletion tests/dataset_builders/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def _deep_compare(
if re.match(excluded_path, path):
return

if type(obj) != type(obj_expected):
if type(obj) is not type(obj_expected):
raise AssertionError(f"{path}: {obj} != {obj_expected}")
if isinstance(obj, (list, tuple)):
if len(obj) != len(obj_expected):
Expand Down
8 changes: 5 additions & 3 deletions tests/dataset_builders/pie/sciarg/test_sciarg.py
Original file line number Diff line number Diff line change
Expand Up @@ -842,9 +842,11 @@ def test_tokenize_documents_all(converted_dataset, tokenizer, dataset_variant):
tokenizer=tokenizer,
return_overflowing_tokens=True,
result_document_type=TOKENIZED_DOCUMENT_TYPE_MAPPING[type(doc)],
partition_layer="labeled_partitions"
if isinstance(doc, TextDocumentWithLabeledPartitions)
else None,
partition_layer=(
"labeled_partitions"
if isinstance(doc, TextDocumentWithLabeledPartitions)
else None
),
strict_span_conversion=strict_span_conversion,
verbose=True,
)
Expand Down
6 changes: 3 additions & 3 deletions tests/unit/builder/test_brat_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,9 @@ def hf_example(request) -> dict:

def test_generate_document(builder, hf_example):
kwargs = dict()
generated_document: Union[
BratDocument, BratDocumentWithMergedSpans
] = builder._generate_document(example=hf_example, **kwargs)
generated_document: Union[BratDocument, BratDocumentWithMergedSpans] = (
builder._generate_document(example=hf_example, **kwargs)
)

if hf_example == HF_EXAMPLES[0]:
assert len(generated_document.relations) == 0
Expand Down

0 comments on commit ed97c40

Please sign in to comment.