Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sciarg dataset #61

Merged
merged 36 commits into from
Dec 11, 2023
Merged

add sciarg dataset #61

merged 36 commits into from
Dec 11, 2023

Conversation

ArneBinder
Copy link
Owner

@ArneBinder ArneBinder commented Nov 22, 2023

In this PR, we add SciArg - an argumentation-mining dataset, according to the description in #10.

Note: This requires pie-dataset>=0.4.0 (in detail: #57, #58, #59, #60).

TODO:

  • complete tests
  • complete PIE dataset card
  • edit HF dataset card, if needed: see https://huggingface.co/datasets/DFKI-SLT/brat/commit/bb8c37d84ddf2da1e691d226c55fef48fd8149b5
  • maybe re-think adhust how parts_of_same relations and dataset variant name=merge_fragmented_spans play together @ArneBinder
    • overwrite configs with a single default one, but with parameter merge_fragmented_spans=true (and set DOCUMENT_CONVERTERS instead of overwriting document_converters)
    • [ ] add entry for ...With LabeledMultiSpan... to DOCUMENT_CONVERTERS will be added in a follow-up PR

Copy link

codecov bot commented Nov 22, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (e8d556b) 94.57% compared to head (8011ece) 94.72%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
+ Coverage   94.57%   94.72%   +0.14%     
==========================================
  Files          18       19       +1     
  Lines        1272     1289      +17     
==========================================
+ Hits         1203     1221      +18     
+ Misses         69       68       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@idalr idalr changed the title sdd sciarg dataset add sciarg dataset Nov 23, 2023
@ArneBinder ArneBinder mentioned this pull request Nov 25, 2023
4 tasks
@ArneBinder ArneBinder force-pushed the add_sciarg_dataset branch 4 times, most recently from f76c85b to 4cd40ce Compare November 27, 2023 11:07
@idalr idalr force-pushed the add_sciarg_dataset branch from 3039049 to 76fda93 Compare December 7, 2023 18:18
@idalr
Copy link
Collaborator

idalr commented Dec 8, 2023

Regarding label counts, there are discrepancies between

  • what's reported in L et al., 2018, p. 43),
  • previous report in pie-document-level (here; labels counted in TextDocumentWith...), and
  • in the current HF dataset card (labels counted in BratDocument's).

Possibly caused by the difference in label assignment during the document extraction and/or conversion processes.

Copy link
Owner Author

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset cards got a bit mixed up, see below. In short: we do not have a HF sciarg dataset, so we can not have a HF dataset card for it.

EDIT: Please also put the content of your above comment to a fitting location in the PIE dataset card.

dataset_builders/hf/sciarg/README.md Outdated Show resolved Hide resolved
dataset_builders/hf/sciarg/README.md Outdated Show resolved Hide resolved
dataset_builders/hf/sciarg/README.md Outdated Show resolved Hide resolved
dataset_builders/pie/sciarg/README.md Outdated Show resolved Hide resolved
dataset_builders/pie/sciarg/README.md Outdated Show resolved Hide resolved
@ArneBinder ArneBinder merged commit e911ec2 into main Dec 11, 2023
4 checks passed
@ArneBinder ArneBinder deleted the add_sciarg_dataset branch December 11, 2023 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants