Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for XQuAD-R #593

Closed
SamuelCahyawijaya opened this issue Apr 1, 2024 · 6 comments · Fixed by #601
Closed

Create dataset loader for XQuAD-R #593

SamuelCahyawijaya opened this issue Apr 1, 2024 · 6 comments · Fixed by #601
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Apr 1, 2024

Dataloader name: xquadr/xquadr.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?xquadr

Dataset xquadr
Description XQuAD-R is a retrieval version of the XQuAD dataset (a cross-lingual extractive QA dataset) that is a part of the LAReQA benchmark. Like XQuAD, XQUAD-R is an 11-way parallel dataset, where each question (out of around 1200) appears in 11 different languages and has 11 parallel correct answers across the languages. It is designed so as to include parallel QA pairs across languages, allowing questions to be matched with answers from different languages. The span-tagging task in XQuAD is converted into a retrieval task by breaking up each contextual paragraph into sentences, and treating each sentence as a possible target answer. There are around 1000 candidate answers in each language.
Subsets -
Languages tha, vie
Tasks Text Retrieval
License Creative Commons Attribution Share Alike 4.0 (cc-by-sa-4.0)
Homepage https://github.com/google-research-datasets/lareqa
HF URL -
Paper URL https://aclanthology.org/2020.emnlp-main.477.pdf
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Apr 1, 2024
@akhdanfadh
Copy link
Collaborator

#self-assign

@akhdanfadh
Copy link
Collaborator

From the homepage:

Note that files contained in this repository for XQuAD-R are simply the original XQuAD data annotated with sentence boundaries for each of the paragraphs, added as an additional field in the jsons.

The additional fields are given like this (download english subset here):

"sentence_breaks": [
  [
      0,    -> sentence's str start index
      165   -> sentence's str end index
  ],
  ...
],
"sentences": [
  "The Panthers defense gave up just 308 points, ranking sixth in the league, while also leading the NFL in interceptions with 24 and boasting four Pro Bowl selections.",  
  ├─> 1st sentence_break corresponds to 1st sentence
  ...
]

Given that XQuAD dataloader is already implemented and this one being focus on the retrieval, thus the task should only be Text Retrieval and the new dataloader should focus on those additional fields IMO.

How should I approach the mapping to pairs schema? I am also quite confused why text retrieval task is using that schema...

@holylovenia @SamuelCahyawijaya @sabilmakbar

@akhdanfadh akhdanfadh added the question Further information is requested label Apr 1, 2024
@holylovenia
Copy link
Contributor

Given that XQuAD dataloader is already implemented and this one being focus on the retrieval, thus the task should only be Text Retrieval and the new dataloader should focus on those additional fields IMO.

I agree with you, @akhdanfadh. I've updated the issue ticket and the datasheet accordingly.

How should I approach the mapping to pairs schema? I am also quite confused why text retrieval task is using that schema...

Because the existing text retrieval datasets commonly have a pair of texts and a label determining whether the pair is a positive or a negative pair. However, XQuAD-R is a QA retrieval task, which is a bit different.

Based on Section 3.1 of the paper:

Specifically, we break each contextual paragraph into sentences, and include all sentences across the dataset as candidate answers. A sentence is considered a correct answer to a question if it contains the target answer span for either that question or an equivalent question in another language (as identified by qas id).

And a data instance example of the dataset:

{
      "context": "The Broncos defeated the Pittsburgh Steelers in the divisional round, 23–16, by scoring 11 points in the final three minutes of the game. They then beat the defending Super Bowl XLIX champion New England Patriots in the AFC Championship Game, 20–18, by intercepting a pass on New England's 2-point conversion attempt with 17 seconds left on the clock. Despite Manning's problems with interceptions during the season, he didn't throw any in their two playoff games.",
      "qas": [
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56beb7953aeaaa14008c92ab",
          "question": "Who lost to the Broncos in the divisional round?"
        },
        {
          "answers": [
            {
              "answer_start": 88,
              "text": "11"
            }
          ],
          "id": "56beb7953aeaaa14008c92ac",
          "question": "How many points did the Broncos score in the last three minutes of the game versus Pittsburgh?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56beb7953aeaaa14008c92ad",
          "question": "Who won Super Bowl XLIX?"
        },
        {
          "answers": [
            {
              "answer_start": 243,
              "text": "20–18"
            }
          ],
          "id": "56beb7953aeaaa14008c92ae",
          "question": "What was the final score of the AFC Championship Game?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17 seconds"
            }
          ],
          "id": "56beb7953aeaaa14008c92af",
          "question": "How much time remained on the clock when the Broncos made the interception that clinched the AFC Championship Game?"
        },
        {
          "answers": [
            {
              "answer_start": 4,
              "text": "Broncos"
            }
          ],
          "id": "56bf36b93aeaaa14008c9561",
          "question": "What team was the divisional round winner between the Broncos and Steelers?"
        },
        {
          "answers": [
            {
              "answer_start": 70,
              "text": "23–16"
            }
          ],
          "id": "56bf36b93aeaaa14008c9562",
          "question": "What was the final score of the game between the Broncos and Steelers?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56bf36b93aeaaa14008c9563",
          "question": "Who won Super Bowl XLIX?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17"
            }
          ],
          "id": "56bf36b93aeaaa14008c9564",
          "question": "How many seconds were left in the game when the Broncos intercepted the pass that won the game?"
        },
        {
          "answers": [
            {
              "answer_start": 360,
              "text": "Manning"
            }
          ],
          "id": "56bf36b93aeaaa14008c9565",
          "question": "During the Bronco's playoff games, who did not throw at all?"
        },
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56d7018a0d65d214001982c2",
          "question": "Who did the Broncos beat in the divisional game?"
        },
        {
          "answers": [
            {
              "answer_start": 88,
              "text": "11"
            }
          ],
          "id": "56d7018a0d65d214001982c3",
          "question": "How many points did the Broncos score in the final three minutes of the Pittsburgh game?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56d7018a0d65d214001982c5",
          "question": "Who did the Broncos defeat in the AFC Championship game?"
        },
        {
          "answers": [
            {
              "answer_start": 25,
              "text": "Pittsburgh Steelers"
            }
          ],
          "id": "56d99f99dc89441400fdb628",
          "question": "Who did the Broncos beat to win their division in 2015?"
        },
        {
          "answers": [
            {
              "answer_start": 192,
              "text": "New England Patriots"
            }
          ],
          "id": "56d99f99dc89441400fdb629",
          "question": "Who did the Broncos beat tp become the AFC champions?"
        },
        {
          "answers": [
            {
              "answer_start": 322,
              "text": "17"
            }
          ],
          "id": "56d99f99dc89441400fdb62c",
          "question": "How many seconds were left in the game when the Patriots failed their 2-point conversion?"
        }
      ],
      "sentence_breaks": [
        [
          0,
          137
        ],
        [
          138,
          351
        ],
        [
          352,
          464
        ]
      ],
      "sentences": [
        "The Broncos defeated the Pittsburgh Steelers in the divisional round, 23–16, by scoring 11 points in the final three minutes of the game.",
        "They then beat the defending Super Bowl XLIX champion New England Patriots in the AFC Championship Game, 20–18, by intercepting a pass on New England's 2-point conversion attempt with 17 seconds left on the clock.",
        "Despite Manning's problems with interceptions during the season, he didn't throw any in their two playoff games."
      ]
    }

Because of this, in XQuAD-R's dataloader, it seems that for each data instance:

  • context is context
  • question is taken from one of the questions
  • answer is taken from one of the sentences whose sentence_breaks covers the corresponding question's answer_start
  • meta can cover the answer_start, text, answer_id, etc.

In this case, it looks like the qa schema would be more suitable?

CMIIW.

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Apr 2, 2024

@holylovenia

answer is taken from one of the sentences

In that case, instead of using text for the answer, we use sentence because this is about retrieval, right? And that will be the main difference with the xquad dataloader.

Aight, implementing this now by adding Tasks.QUESTION_ANSWERING_RETRIEVAL.

@akhdanfadh akhdanfadh added in-progress Assignee has given confirmation on progress and ETA and removed question Further information is requested labels Apr 2, 2024
@akhdanfadh akhdanfadh added pr-ready A PR that closes this issue is Ready to be reviewed and removed in-progress Assignee has given confirmation on progress and ETA labels Apr 2, 2024
@sabilmakbar
Copy link
Collaborator

I think the answer field should be the actual answer provided in the text field, but the corresponding sentences that contain the answer can be put in the meta, along with all candidates, for the SEACrowd Schema implementation. wdyt?

@akhdanfadh
Copy link
Collaborator

@sabilmakbar I think that will be against the main idea of the dataset as it is more about the retrieval part. This dataset is also for benchmark, not about the ground truth text. Quoting the github homepage:

As part of the LAReQA benchmark, <...> We release XQuAD with sentence breaks in this repository for use as XQuAD-R. <...> Note that files contained in this repository for XQuAD-R are simply the original XQuAD data annotated with sentence boundaries for each of the paragraphs, added as an additional field in the jsons.

@holylovenia also quoted from the paper. Based on this, we can think of the dataset as a "multiple choice" with the choices being all the sentences. Though it's not.

Specifically, we break each contextual paragraph into sentences, and include all sentences across the dataset as candidate answers. A sentence is considered a correct answer to a question if it contains the target answer span for either that question or an equivalent question in another language (as identified by qas id).

Overall, we can still put the ground truth answer text in meta fields, though. So, no important data is being neglected here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants