Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Chang Chuan Goh; Yue Yang; Anthony Bellotti; Xiuping Hua

doi:10.3390/info16050397

Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Chang Chuan Goh, Yue Yang, Anthony Bellotti, Xiuping Hua

Research output: Journal Publication › Article › peer-review

Abstract

We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.

Original language	English
Article number	397
Journal	Information (Switzerland)
Volume	16
Issue number	5
DOIs	https://doi.org/10.3390/info16050397
Publication status	Published - May 2025

Keywords

corporate fraud
fraud type
industry
machine learning
population drift
segmented model

ASJC Scopus subject areas

Information Systems

Access to Document

10.3390/info16050397

Cite this

@article{3d57da76b1bf4bac9bd5f95a10d6a863,

title = "Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows",

abstract = "We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.",

keywords = "corporate fraud, fraud type, industry, machine learning, population drift, segmented model",

author = "Goh, {Chang Chuan} and Yue Yang and Anthony Bellotti and Xiuping Hua",

note = "Publisher Copyright: {\textcopyright} 2025 by the authors.",

year = "2025",

month = may,

doi = "10.3390/info16050397",

language = "English",

volume = "16",

journal = "Information (Switzerland)",

issn = "2078-2489",

number = "5",

}

TY - JOUR

T1 - Machine Learning for Chinese Corporate Fraud Prediction

T2 - Segmented Models Based on Optimal Training Windows

AU - Goh, Chang Chuan

AU - Yang, Yue

AU - Bellotti, Anthony

AU - Hua, Xiuping

PY - 2025/5

Y1 - 2025/5

N2 - We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.

AB - We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.

KW - corporate fraud

KW - fraud type

KW - industry

KW - machine learning

KW - population drift

KW - segmented model

UR - http://www.scopus.com/inward/record.url?scp=105006590938&partnerID=8YFLogxK

U2 - 10.3390/info16050397

DO - 10.3390/info16050397

M3 - Article

AN - SCOPUS:105006590938

SN - 2078-2489

VL - 16

JO - Information (Switzerland)

JF - Information (Switzerland)

IS - 5

M1 - 397

ER -

Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this