Research QA Pairs Pipeline

Question: Research and design a reliable and efficient pipeline for processing Question-Answer (Q&A) pairs. The objective is to:
- Identify tools and solutions suitable for implementing the pipeline, aligning with previous data extraction methods.
- Extract questions (Qs) and answers (As), perform matching, and prepare the data for training purposes.
- Provide brief descriptions of the selected solution(s), citing relevant reference sources such as papers or articles.
- Optionally, draft an architecture for the pipeline, although detailed architecture design is not mandatory considering future development possibilities.
- Utilize tools under permissive licenses without reinventing existing solutions.
Results:

Report on Q&A Pairs Pipeline Research Findings

We are trying to enhance our pipeline so that we can extract question/answer pairs from the gathered data that can be then used to finetune our LLM model.

Findings

The most promising existing option appears to be the existing lm-question-generation package lm-question-generation. It uses the t5 QG model for automatic question/answer or question-answer generation given input data. For now, we will use this option for our question-answer-generation (QAG).
Another option would be to use an existing llm with a specific prompt to extract questions and answers. As in QA_enricher.py, one could design a prompt such as f"Write a question for the following input: '{input}'", where input corresponds to a text chunk with a predefined size from the gathered data. A possible problem that could arise is the model size overhead. An llm such as gemma-2b has 2.51B params (gemma-2b), whereas the biggest qag model, t5-large, only has 738M params and t5-small only has 60.5M params. However, it might be useful to try a qag scheme using an existing llm in order to improve our initial qa dataset.
Other resources that might be useful:
- Automatic-Question-Generator
- Questgen.ai

Output format

There are 2 formats to fine tune a Q&A LLM:

Conversation format:

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Instruction format:

{"prompt": "...prompt text...", "completion": "...generated text..."}

We should do more research on these 2 formats and see which one works best for our project, however, the conversion between the 2 formats should be straightforward.

Conclusion

The lmqg package provides a straightforward and lightweight resource for generating both questions and answers given input text data. It will serve as a first approach for qag, later llms could be used for this task as well as they might provide better results, but will be more resource-intensive. For augmenting the question/answer dataset existing llms, such as GEMMA will be used, however lmqg might be feasible for this task as well. Which resource to use will be determined by trial and testing the different resources.

Citation

LMQG: Please cite the following paper if you use any resource and see the code to reproduce the model if needed.

Generative Language Models for Paragraph-Level Question Generation, EMNLP 2022 Main: The QG models (code to reproduce experiments).

@inproceedings{ushio-etal-2022-generative,
    title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration",
    author = "Ushio, Asahi  and
        Alva-Manchego, Fernando  and
        Camacho-Collados, Jose",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}

An Empirical Comparison of LM-based Question and Answer Generation Methods, ACL 2023, Finding: The QAG models (code to reproduce experiments).

@inproceedings{ushio-etal-2023-an-empirical,
    title = "An Empirical Comparison of LM-based Question and Answer Generation Methods",
    author = "Ushio, Asahi  and
        Alva-Manchego, Fernando  and
        Camacho-Collados, Jose",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics: Findings",
    month = Jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
}

A Practical Toolkit for Multilingual Question and Answer Generation, ACL 2023, System Demonstration: The library and demo (code to reproduce experiments).

@inproceedings{ushio-etal-2023-a-practical-toolkit,
    title = "A Practical Toolkit for Multilingual Question and Answer Generation",
    author = "Ushio, Asahi  and
        Alva-Manchego, Fernando  and
        Camacho-Collados, Jose",
    booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = Jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
}

Link to Original Issue: Research Q&A Pairs Pipeline Issue #27
Original Assignee: Daniel Schaeffer

Provide feedback

Saved searches