TY - GEN
T1 - Evaluation of the Code Generated By Large Language Models
T2 - 49th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2025
AU - Ying, Zhihao
AU - Towey, Dave
AU - Zhang, Yifan
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The rapid development of Large Language Models (LLMs), such as ChatGPT and DeepSeek, has revolutionized software development, particularly in the domain of automated code generation. These models, built on architectures like the Transformer, have demonstrated remarkable capabilities in generating human-like text and source code, significantly enhancing developer productivity and reducing development time. However, the widespread adoption of LLMs for code generation raises concerns regarding the reliability, quality, and potential risks associated with the generated code. This article illustrates and analyzes the state of the art in evaluating LLM-generated code, summarizing research findings, and application areas. This paper highlights the challenges in distinguishing between machine-generated and human-written code, as well as the potential for LLMs to introduce security vulnerabilities and maintainability issues. We discuss the implications of these findings for both researchers and practitioners, emphasizing the need for continued research in the evaluation of LLM-generated code. Finally, we identify gaps in the literature and propose future research directions, such as the development of more robust benchmarks and improved evaluation metrics. By providing a thorough overview of the current landscape, this paper provides a valuable resource for researchers and practitioners interested in LLM's code generation capabilities and limitations. We also highlight the importance of ongoing evaluation and refinement of these models to ensure their safe and effective integration into software-development practices.
AB - The rapid development of Large Language Models (LLMs), such as ChatGPT and DeepSeek, has revolutionized software development, particularly in the domain of automated code generation. These models, built on architectures like the Transformer, have demonstrated remarkable capabilities in generating human-like text and source code, significantly enhancing developer productivity and reducing development time. However, the widespread adoption of LLMs for code generation raises concerns regarding the reliability, quality, and potential risks associated with the generated code. This article illustrates and analyzes the state of the art in evaluating LLM-generated code, summarizing research findings, and application areas. This paper highlights the challenges in distinguishing between machine-generated and human-written code, as well as the potential for LLMs to introduce security vulnerabilities and maintainability issues. We discuss the implications of these findings for both researchers and practitioners, emphasizing the need for continued research in the evaluation of LLM-generated code. Finally, we identify gaps in the literature and propose future research directions, such as the development of more robust benchmarks and improved evaluation metrics. By providing a thorough overview of the current landscape, this paper provides a valuable resource for researchers and practitioners interested in LLM's code generation capabilities and limitations. We also highlight the importance of ongoing evaluation and refinement of these models to ensure their safe and effective integration into software-development practices.
KW - Artificial intelligence
KW - code error
KW - code evaluation
KW - code generation
KW - large language model
UR - https://www.scopus.com/pages/publications/105016153409
U2 - 10.1109/COMPSAC65507.2025.00064
DO - 10.1109/COMPSAC65507.2025.00064
M3 - Conference contribution
AN - SCOPUS:105016153409
T3 - Proceedings - 2025 IEEE 49th Annual Computers, Software, and Applications Conference, COMPSAC 2025
SP - 440
EP - 449
BT - Proceedings - 2025 IEEE 49th Annual Computers, Software, and Applications Conference, COMPSAC 2025
A2 - Shahriar, Hossain
A2 - Alam, Kazi Shafiul
A2 - Ohsaki, Hiroyuki
A2 - Cimato, Stelvio
A2 - Capretz, Miriam
A2 - Ahmed, Shamem
A2 - Ahamed, Sheikh Iqbal
A2 - Majumder, AKM Jahangir Alam
A2 - Haque, Munirul
A2 - Yoshihisa, Tomoki
A2 - Cuzzocrea, Alfredo
A2 - Takemoto, Michiharu
A2 - Sakib, Nazmus
A2 - Elsayed, Marwa
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 8 July 2025 through 11 July 2025
ER -