Evaluation of the Code Generated By Large Language Models: The State of the Art

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

The rapid development of Large Language Models (LLMs), such as ChatGPT and DeepSeek, has revolutionized software development, particularly in the domain of automated code generation. These models, built on architectures like the Transformer, have demonstrated remarkable capabilities in generating human-like text and source code, significantly enhancing developer productivity and reducing development time. However, the widespread adoption of LLMs for code generation raises concerns regarding the reliability, quality, and potential risks associated with the generated code. This article illustrates and analyzes the state of the art in evaluating LLM-generated code, summarizing research findings, and application areas. This paper highlights the challenges in distinguishing between machine-generated and human-written code, as well as the potential for LLMs to introduce security vulnerabilities and maintainability issues. We discuss the implications of these findings for both researchers and practitioners, emphasizing the need for continued research in the evaluation of LLM-generated code. Finally, we identify gaps in the literature and propose future research directions, such as the development of more robust benchmarks and improved evaluation metrics. By providing a thorough overview of the current landscape, this paper provides a valuable resource for researchers and practitioners interested in LLM's code generation capabilities and limitations. We also highlight the importance of ongoing evaluation and refinement of these models to ensure their safe and effective integration into software-development practices.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE 49th Annual Computers, Software, and Applications Conference, COMPSAC 2025
EditorsHossain Shahriar, Kazi Shafiul Alam, Hiroyuki Ohsaki, Stelvio Cimato, Miriam Capretz, Shamem Ahmed, Sheikh Iqbal Ahamed, AKM Jahangir Alam Majumder, Munirul Haque, Tomoki Yoshihisa, Alfredo Cuzzocrea, Michiharu Takemoto, Nazmus Sakib, Marwa Elsayed
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages440-449
Number of pages10
ISBN (Electronic)9798331574345
DOIs
Publication statusPublished - 2025
Event49th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2025 - Toronto, Canada
Duration: 8 Jul 202511 Jul 2025

Publication series

NameProceedings - 2025 IEEE 49th Annual Computers, Software, and Applications Conference, COMPSAC 2025

Conference

Conference49th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2025
Country/TerritoryCanada
CityToronto
Period8/07/2511/07/25

Free Keywords

  • Artificial intelligence
  • code error
  • code evaluation
  • code generation
  • large language model

ASJC Scopus subject areas

  • Computational Mathematics
  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Software
  • Media Technology

Fingerprint

Dive into the research topics of 'Evaluation of the Code Generated By Large Language Models: The State of the Art'. Together they form a unique fingerprint.

Cite this