Abstract
We propose a comprehensive and practical framework for Chinese corporate fraud prediction which incorporates classifiers, class imbalance, population drift, segmented models, and model evaluation using machine learning algorithms. Based on a three-stage experiment, we first find that the random forest classifier has the best performance in predicting corporate fraud among 17 machine learning models. We then implement the sliding time window approach to handle population drift, and the optimal training window found demonstrates the existence of population drift in fraud detection and the need to address it for improved model performance. Using the best machine learning model and optimal training window, we build general model and segmented models to compare fraud types and industries based on their respective predictive performance via four evaluation metrics and top features using SHAP. The results indicate that segmented models have a better predictive performance than the general model for fraud types with low fraud rates and are as good as the general model for most industries when controlling for training set size. The dissimilarities between the top features set of the general and segmented models suggest that segmented models are useful in providing a better understanding of fraud occurrence.
Original language | English |
---|---|
Article number | 397 |
Journal | Information (Switzerland) |
Volume | 16 |
Issue number | 5 |
DOIs | |
Publication status | Published - May 2025 |
Keywords
- corporate fraud
- fraud type
- industry
- machine learning
- population drift
- segmented model
ASJC Scopus subject areas
- Information Systems