Abstract
The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.
Original language | English |
---|---|
Pages (from-to) | 389-417 |
Number of pages | 29 |
Journal | Foundations of Data Science |
Volume | 1 |
Issue number | 4 |
DOIs | |
Publication status | Published - Dec 2019 |
Externally published | Yes |
Keywords
- Class imbalance
- logistic regression
- relabeling
ASJC Scopus subject areas
- Analysis
- Statistics and Probability
- Computational Theory and Mathematics
- Applied Mathematics