Repository logo
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Scholalry Output
  3. Publications
  4. PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
 
  • Details

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Source
Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024
Date Issued
2024-01-01
Author(s)
Yadav, Ankit
Beniwal, Himanshu
Singh, Mayank  
DOI
10.18653/v1/2024.findings-emnlp.996
Abstract
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at https://github.com/PythonSaga/PythonSaga.
Unpaywall
URI
https://d8.irins.org/handle/IITG2025/28469
IITGN Knowledge Repository Developed and Managed by Library

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify