Skip to content

RAG Benchmarking

Trishal Kumar edited this page Jul 29, 2024 · 1 revision

To address the challenges of AI evaluation, pRAGyoga aims to create a comprehensive legal benchmarking dataset in India. This dataset will be used to test RAG systems on key metrics such as recall, accuracy, and possibly the cost-effectiveness of each RAG pipeline. Recall measures the system's ability to retrieve all relevant information, while accuracy assesses the correctness of the information retrieved.

The creation of this dataset will involve a diverse group of students from various disciplines, including law, social sciences, gender studies, data sciences, and humanities. This approach brings diverse and realistic questions that purely computational methods might overlook. Students can generate varied, real-world queries and provide detailed feedback, which enhances the system's reliability. Additionally, using a crowd-sourced method to create benchmarked data ensures that it reflects a wide range of perspectives, making it a strong standard for evaluating AI systems.

Implementation: The program has conducted pilots with a handful of students, with a scale up with multiple institutions planned. The students go through a diverse set of legal documents, including acts, case laws, judgements etc. This is followed by the following steps:

  • Using open-source annotation tools, the student volunteers generate questions from the provided documents and annotate the parts of the document containing the answers to their questions.
  • Students input their questions into the evaluation tool, which uses retrieval methods to pull up relevant information from the documents in its knowledge base. Typically, the system retrieves five chunks of information for each query.
  • Students review the retrieved information to check for relevance. If the information was relevant, they highlighted the parts. If not, they added necessary details in a free-text box.

Quality Control: Considering that evaluating the quality of evaluations of each student would be difficult as the program scales, the following measures were undertaken:

  • Peer-Review: Once the annotations are complete, different students review the work of their peers. They check if the questions make sense and if the answers match the questions. They can flag or edit the answers as needed.
  • Expert Review: Questions and answers flagged or edited by students undergo a final review by experts, who either confirm the annotations or make further edits to ensure accuracy.

Results of the pilot: Initial pilots showed that students could create a diverse and relevant set of questions and answers, as compared to the straightforward and pointed questions generated by automated evaluation frameworks. The exercise has been valuable in highlighting areas where the bot performs well and identifying aspects that need improvement.

Limitations: While this exercise is beneficial, it is time-consuming and depends heavily on the quality of student annotations.Comparing student-generated answers with Jugalbandi's responses is done through additional LLM evaluations, which may introduce some subjectivity, when an LLM is used in the evaluation of another LLMs performance.

Clone this wiki locally