PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Authors: Yadav, Ankit and Beniwal, Himanshu and Singh, Mayank

Abstract:

Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this URL.

Link: Read Paper

Labels: code generation, program synthesis, benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper_23.md

paper_23.md

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Files

paper_23.md

Latest commit

History

paper_23.md

File metadata and controls

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs