Addressing Class Imbalance in Loan Default Prediction: A Cluster-Based Synthesis Approach

Yue Yang; Boon Giin Lee; Anthony Graham Bellotti; Qinglin Mao

Abstract

Prediction of loan default is essentially a binary classification problem. Nevertheless, most available public loan-level dataset is highly imbalanced where the default data account for less than 1% of all the loans. The issue of class imbalance has a significant impact on causing a model heavily biased on classification, which is not clearly presented in the common performance evaluation metrics. Balancing of both default and non-default data in a dataset through data synthetization technique, by increasing the data size of default data is an important means of addressing the model bias issue. This study proposes an integration of clustering and oversampling methods based on the foundation of state-of-the-art methods, including SMOTE, GAN, etc., to preprocess the dataset. The method is tested on the US Freddie Mac single family loan-level dataset. The dataset is divided into several clusters using hierarchical clustering, to address the “Simpson’s paradox” issue. Then, the SMOTE and GAN (and other) methods are carried out to synthesize minority data where the data size of default data matches the data size of non-default data in each cluster. Additionally, this study explores the impact of preprocessing techniques on model performance. The study utilizes classifiers such as decision tree, multilayer perception, CatBoost classifier and XGBoost classifier to perform classification on the augmented dataset and use ensemble learning to improve the predictive performance of the classification models. The experimental results indicate that the default prediction using cluster-based data as training and testing samples achieved better performance, compared to the default prediction using whole data as training and testing sets. The proposed method indicates significant improvements of predicting minority class (default) which addresses the generalizability limitation of existing oversampling methods.

Original language	English
Pages	1
Number of pages	1
Publication status	Published - 1 Sept 2023
Event	Credit Scoring and Credit Control Conference XVIII - Duration: 30 Aug 2023 → … https://www.crc.business-school.ed.ac.uk/conferences

Conference

Conference	Credit Scoring and Credit Control Conference XVIII
Period	30/08/23 → …
Internet address	https://www.crc.business-school.ed.ac.uk/conferences

Keywords

finance
oversampling
fraud detection

Cite this

@conference{4ae610d1f70c42f9a0118c54585bd30f,

title = "Addressing Class Imbalance in Loan Default Prediction: A Cluster-Based Synthesis Approach",

abstract = "Prediction of loan default is essentially a binary classification problem. Nevertheless, most available public loan-level dataset is highly imbalanced where the default data account for less than 1\% of all the loans. The issue of class imbalance has a significant impact on causing a model heavily biased on classification, which is not clearly presented in the common performance evaluation metrics. Balancing of both default and non-default data in a dataset through data synthetization technique, by increasing the data size of default data is an important means of addressing the model bias issue. This study proposes an integration of clustering and oversampling methods based on the foundation of state-of-the-art methods, including SMOTE, GAN, etc., to preprocess the dataset. The method is tested on the US Freddie Mac single family loan-level dataset. The dataset is divided into several clusters using hierarchical clustering, to address the “Simpson{\textquoteright}s paradox” issue. Then, the SMOTE and GAN (and other) methods are carried out to synthesize minority data where the data size of default data matches the data size of non-default data in each cluster. Additionally, this study explores the impact of preprocessing techniques on model performance. The study utilizes classifiers such as decision tree, multilayer perception, CatBoost classifier and XGBoost classifier to perform classification on the augmented dataset and use ensemble learning to improve the predictive performance of the classification models. The experimental results indicate that the default prediction using cluster-based data as training and testing samples achieved better performance, compared to the default prediction using whole data as training and testing sets. The proposed method indicates significant improvements of predicting minority class (default) which addresses the generalizability limitation of existing oversampling methods. ",

keywords = "finance, oversampling, fraud detection",

author = "Yue Yang and Lee, \{Boon Giin\} and Bellotti, \{Anthony Graham\} and Qinglin Mao",

year = "2023",

month = sep,

day = "1",

language = "English",

pages = "1",

note = "Credit Scoring and Credit Control Conference XVIII ; Conference date: 30-08-2023",

url = "https://www.crc.business-school.ed.ac.uk/conferences",

}

TY - CONF

T1 - Addressing Class Imbalance in Loan Default Prediction: A Cluster-Based Synthesis Approach

AU - Yang, Yue

AU - Lee, Boon Giin

AU - Bellotti, Anthony Graham

AU - Mao, Qinglin

PY - 2023/9/1

Y1 - 2023/9/1

N2 - Prediction of loan default is essentially a binary classification problem. Nevertheless, most available public loan-level dataset is highly imbalanced where the default data account for less than 1% of all the loans. The issue of class imbalance has a significant impact on causing a model heavily biased on classification, which is not clearly presented in the common performance evaluation metrics. Balancing of both default and non-default data in a dataset through data synthetization technique, by increasing the data size of default data is an important means of addressing the model bias issue. This study proposes an integration of clustering and oversampling methods based on the foundation of state-of-the-art methods, including SMOTE, GAN, etc., to preprocess the dataset. The method is tested on the US Freddie Mac single family loan-level dataset. The dataset is divided into several clusters using hierarchical clustering, to address the “Simpson’s paradox” issue. Then, the SMOTE and GAN (and other) methods are carried out to synthesize minority data where the data size of default data matches the data size of non-default data in each cluster. Additionally, this study explores the impact of preprocessing techniques on model performance. The study utilizes classifiers such as decision tree, multilayer perception, CatBoost classifier and XGBoost classifier to perform classification on the augmented dataset and use ensemble learning to improve the predictive performance of the classification models. The experimental results indicate that the default prediction using cluster-based data as training and testing samples achieved better performance, compared to the default prediction using whole data as training and testing sets. The proposed method indicates significant improvements of predicting minority class (default) which addresses the generalizability limitation of existing oversampling methods.

AB - Prediction of loan default is essentially a binary classification problem. Nevertheless, most available public loan-level dataset is highly imbalanced where the default data account for less than 1% of all the loans. The issue of class imbalance has a significant impact on causing a model heavily biased on classification, which is not clearly presented in the common performance evaluation metrics. Balancing of both default and non-default data in a dataset through data synthetization technique, by increasing the data size of default data is an important means of addressing the model bias issue. This study proposes an integration of clustering and oversampling methods based on the foundation of state-of-the-art methods, including SMOTE, GAN, etc., to preprocess the dataset. The method is tested on the US Freddie Mac single family loan-level dataset. The dataset is divided into several clusters using hierarchical clustering, to address the “Simpson’s paradox” issue. Then, the SMOTE and GAN (and other) methods are carried out to synthesize minority data where the data size of default data matches the data size of non-default data in each cluster. Additionally, this study explores the impact of preprocessing techniques on model performance. The study utilizes classifiers such as decision tree, multilayer perception, CatBoost classifier and XGBoost classifier to perform classification on the augmented dataset and use ensemble learning to improve the predictive performance of the classification models. The experimental results indicate that the default prediction using cluster-based data as training and testing samples achieved better performance, compared to the default prediction using whole data as training and testing sets. The proposed method indicates significant improvements of predicting minority class (default) which addresses the generalizability limitation of existing oversampling methods.

KW - finance

KW - oversampling

KW - fraud detection

UR - https://edinburghuni.eventsair.com/credit-scoring-and-credit-control-conference-xviii

M3 - Abstract

SP - 1

T2 - Credit Scoring and Credit Control Conference XVIII

Y2 - 30 August 2023

ER -

Addressing Class Imbalance in Loan Default Prediction: A Cluster-Based Synthesis Approach

Abstract

Conference

Keywords

Other files and links

Fingerprint

Cite this