PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Yadav, AnkitAnkitYadavBeniwal, HimanshuHimanshuBeniwalSingh, MayankMayankSingh2025-08-312025-08-312024-01-01[9798891761681]10.18653/v1/2024.findings-emnlp.9962-s2.0-85216394331https://repository.iitgn.ac.in/handle/IITG2025/28469Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at https://github.com/PythonSaga/PythonSaga.falsePythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMsConference Paperhttps://aclanthology.org/2024.findings-emnlp.996.pdf17113-17126202415