A Novel Vulnerability-Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation

Jinfu Chen; Jiapeng Zhou; Wei Lin; Dave Towey; Saihua Cai; Haibo Chen; Jingyi Chen; Yemin Yin

doi:10.1002/smr.70026

A Novel Vulnerability-Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation

Jinfu Chen, Jiapeng Zhou, Wei Lin, Dave Towey, Saihua Cai, Haibo Chen, Jingyi Chen, Yemin Yin

School of Computer Science

Research output: Journal Publication › Article › peer-review

Abstract

With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code type (e.g., source code or intermediate representation [IR]), which may lead to both the global features of the code slices and the memory operation information not being captured or considered. In particular, vulnerability detection based on source-code features cannot usually include some macro or type definition content. In this paper, we propose a vulnerability-detection method that combines the semantic features of source code and the low level virtual machine (LLVM) IR. Our proposed approach starts by slicing (C/C++) source files using improved slicing techniques to cover more comprehensive code information. It then extracts semantic information from the LLVM IR based on the executable source code. This can enrich the features fed to the artificial neural network (ANN) model for learning. We conducted an experimental evaluation using a publicly-available dataset of 11,381 C/C++ programs. The experimental results show the vulnerability-detection accuracy of our proposed method to reach over 96% for code slices generated according to four different slicing criteria. This outperforms most other compared detection methods.

Original language	English
Article number	e70026
Journal	Journal of software: Evolution and Process
Volume	37
Issue number	5
DOIs	https://doi.org/10.1002/smr.70026
Publication status	Published - May 2025

Keywords

deep learning
intermediate representation
program representation
vulnerability detection

ASJC Scopus subject areas

Software

Access to Document

10.1002/smr.70026

Cite this

@article{04a15832b17a4f5dabd1f37ca6f4efdc,

title = "A Novel Vulnerability-Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation",

abstract = "With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code type (e.g., source code or intermediate representation [IR]), which may lead to both the global features of the code slices and the memory operation information not being captured or considered. In particular, vulnerability detection based on source-code features cannot usually include some macro or type definition content. In this paper, we propose a vulnerability-detection method that combines the semantic features of source code and the low level virtual machine (LLVM) IR. Our proposed approach starts by slicing (C/C++) source files using improved slicing techniques to cover more comprehensive code information. It then extracts semantic information from the LLVM IR based on the executable source code. This can enrich the features fed to the artificial neural network (ANN) model for learning. We conducted an experimental evaluation using a publicly-available dataset of 11,381 C/C++ programs. The experimental results show the vulnerability-detection accuracy of our proposed method to reach over 96% for code slices generated according to four different slicing criteria. This outperforms most other compared detection methods.",

keywords = "deep learning, intermediate representation, program representation, vulnerability detection",

author = "Jinfu Chen and Jiapeng Zhou and Wei Lin and Dave Towey and Saihua Cai and Haibo Chen and Jingyi Chen and Yemin Yin",

note = "Publisher Copyright: {\textcopyright} 2025 John Wiley & Sons Ltd.",

year = "2025",

month = may,

doi = "10.1002/smr.70026",

language = "English",

volume = "37",

journal = "Journal of software: Evolution and Process",

issn = "1532-060X",

publisher = "John Wiley and Sons Ltd",

number = "5",

}

TY - JOUR

T1 - A Novel Vulnerability-Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation

AU - Chen, Jinfu

AU - Zhou, Jiapeng

AU - Lin, Wei

AU - Towey, Dave

AU - Cai, Saihua

AU - Chen, Haibo

AU - Chen, Jingyi

AU - Yin, Yemin

PY - 2025/5

Y1 - 2025/5

N2 - With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code type (e.g., source code or intermediate representation [IR]), which may lead to both the global features of the code slices and the memory operation information not being captured or considered. In particular, vulnerability detection based on source-code features cannot usually include some macro or type definition content. In this paper, we propose a vulnerability-detection method that combines the semantic features of source code and the low level virtual machine (LLVM) IR. Our proposed approach starts by slicing (C/C++) source files using improved slicing techniques to cover more comprehensive code information. It then extracts semantic information from the LLVM IR based on the executable source code. This can enrich the features fed to the artificial neural network (ANN) model for learning. We conducted an experimental evaluation using a publicly-available dataset of 11,381 C/C++ programs. The experimental results show the vulnerability-detection accuracy of our proposed method to reach over 96% for code slices generated according to four different slicing criteria. This outperforms most other compared detection methods.

AB - With the increasingly frequent attacks on software systems, software security is an issue that must be addressed. Within software security, automated detection of software vulnerabilities is an important subject. Most existing vulnerability detectors rely on the features of a single code type (e.g., source code or intermediate representation [IR]), which may lead to both the global features of the code slices and the memory operation information not being captured or considered. In particular, vulnerability detection based on source-code features cannot usually include some macro or type definition content. In this paper, we propose a vulnerability-detection method that combines the semantic features of source code and the low level virtual machine (LLVM) IR. Our proposed approach starts by slicing (C/C++) source files using improved slicing techniques to cover more comprehensive code information. It then extracts semantic information from the LLVM IR based on the executable source code. This can enrich the features fed to the artificial neural network (ANN) model for learning. We conducted an experimental evaluation using a publicly-available dataset of 11,381 C/C++ programs. The experimental results show the vulnerability-detection accuracy of our proposed method to reach over 96% for code slices generated according to four different slicing criteria. This outperforms most other compared detection methods.

KW - deep learning

KW - intermediate representation

KW - program representation

KW - vulnerability detection

UR - http://www.scopus.com/inward/record.url?scp=105004194125&partnerID=8YFLogxK

U2 - 10.1002/smr.70026

DO - 10.1002/smr.70026

M3 - Article

AN - SCOPUS:105004194125

SN - 1532-060X

VL - 37

JO - Journal of software: Evolution and Process

JF - Journal of software: Evolution and Process

IS - 5

M1 - e70026

ER -

A Novel Vulnerability-Detection Method Based on the Semantic Features of Source Code and the LLVM Intermediate Representation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this