ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

Yazhe Li; Tony Bellotti; Niall Adams

doi:10.3934/fods.2019016

ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

Yazhe Li, Tony Bellotti, Niall Adams

Research output: Journal Publication › Article › peer-review

11 Citations (Scopus)

Abstract

The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

Original language	English
Pages (from-to)	389-417
Number of pages	29
Journal	Foundations of Data Science
Volume	1
Issue number	4
DOIs	https://doi.org/10.3934/fods.2019016
Publication status	Published - Dec 2019
Externally published	Yes

Keywords

Class imbalance
logistic regression
relabeling

ASJC Scopus subject areas

Analysis
Statistics and Probability
Computational Theory and Mathematics
Applied Mathematics

Access to Document

10.3934/fods.2019016

Cite this

@article{86d5b156e1504c82b7f851866826ade9,

title = "ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING",

abstract = "The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen{\textquoteright}s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.",

keywords = "Class imbalance, logistic regression, relabeling",

author = "Yazhe Li and Tony Bellotti and Niall Adams",

note = "Publisher Copyright: {\textcopyright} American Institute of Mathematical Sciences.",

year = "2019",

month = dec,

doi = "10.3934/fods.2019016",

language = "English",

volume = "1",

pages = "389--417",

journal = "Foundations of Data Science",

issn = "2639-8001",

publisher = "American Institute of Mathematical Sciences",

number = "4",

}

TY - JOUR

T1 - ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

AU - Li, Yazhe

AU - Bellotti, Tony

AU - Adams, Niall

N1 - Publisher Copyright: © American Institute of Mathematical Sciences.

PY - 2019/12

Y1 - 2019/12

N2 - The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

AB - The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen’s results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

KW - Class imbalance

KW - logistic regression

KW - relabeling

UR - http://www.scopus.com/inward/record.url?scp=85104446570&partnerID=8YFLogxK

U2 - 10.3934/fods.2019016

DO - 10.3934/fods.2019016

M3 - Article

AN - SCOPUS:85104446570

SN - 2639-8001

VL - 1

SP - 389

EP - 417

JO - Foundations of Data Science

JF - Foundations of Data Science

IS - 4

ER -

ISSUES USING LOGISTIC REGRESSION WITH CLASS IMBALANCE, WITH A CASE STUDY FROM CREDIT RISK MODELLING

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this