Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows

Chang Chuan Goh, Yue Yang, Anthony Bellotti, Xiuping Hua

Research output: Journal PublicationArticlepeer-review

Abstract

We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.

Original languageEnglish
Article number397
JournalInformation (Switzerland)
Volume16
Issue number5
DOIs
Publication statusPublished - May 2025

Keywords

  • corporate fraud
  • fraud type
  • industry
  • machine learning
  • population drift
  • segmented model

ASJC Scopus subject areas

  • Information Systems

Fingerprint

Dive into the research topics of 'Machine Learning for Chinese Corporate Fraud Prediction: Segmented Models Based on Optimal Training Windows'. Together they form a unique fingerprint.

Cite this