ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

Yazhe Li, Tony Bellotti, Niall Adams

Research output: Journal PublicationArticlepeer-review

7 Citations (Scopus)

Abstract

The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

Original languageEnglish
Pages (from-to)389-417
Number of pages29
JournalFoundations of Data Science
Volume1
Issue number4
DOIs
Publication statusPublished - Dec 2019
Externally publishedYes

Keywords

  • Class imbalance
  • logistic regression
  • relabeling

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Analysis
  • Statistics and Probability
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING'. Together they form a unique fingerprint.

Cite this