Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data

Teena Rai; Yuan Shen; Jun He; Mufti Mahmud; David J. Brown; Jaspreet Kaur; Emma O'Dowd; David R. Baldwin; Richard Hubbard

doi:10.1109/IJCNN60899.2024.10650819

Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data

Teena Rai, Yuan Shen, Jun He, Mufti Mahmud, David J. Brown, Jaspreet Kaur, Emma O'Dowd, David R. Baldwin, Richard Hubbard

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Machine learning (ML) models in healthcare are increasing but the lack of interpretability of these models results in them not being suitable for use in clinical practice. In the medical field, it is vital to clarify to clinicians and patients the rationale behind a model's high probability prediction for a specific disease in an individual patient. This transparency fosters trust, facilitates informed decision-making, and empowers both clinicians and patients to understand the underlying factors driving the model's output. This paper aims to incorporate explainability to ML models such as Random Forest (RF), eXtreme Gradient Boosting (XGBoost) and Multilyer Perceptron (MLP) for using with Clinical Practice Research Datalink (CPRD) data and interpret them in terms of feature importance to identify the top most features when distinguishing between lung cancer and non-lung cancer cases. The SHapley Additive exPlanations (SHAP) method has been used in this work to interpret the models. We use SHAP to gain insights into explaining individual predictions as well as interpreting them globally. The feature importance from SHAP is compared with the default feature importance of the models to identify any discrepancies between the results. Based on experimental findings, it has been found that the default feature importance from the tree-based models and SHAP is consistent with features 'age' and 'smoking status' which serve as the top features for predicting lung cancer among patients. Additionally, this work pinpoints that feature importance for a single patient may vary leading to a varied prediction depending on the employed model. Finally, the work concludes that individual-level explanation of feature importance is crucial in mission-critical applications like healthcare to better understand personal health and lifestyle factors in the early prediction of diseases that may lead to terminal illness.

Original language	English
Title of host publication	2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350359312
DOIs	https://doi.org/10.1109/IJCNN60899.2024.10650819
Publication status	Published - 2024
Externally published	Yes
Event	2024 International Joint Conference on Neural Networks, IJCNN 2024 - Yokohama, Japan Duration: 30 Jun 2024 → 5 Jul 2024

Publication series

Name	Proceedings of the International Joint Conference on Neural Networks

Conference

Conference	2024 International Joint Conference on Neural Networks, IJCNN 2024
Country/Territory	Japan
City	Yokohama
Period	30/06/24 → 5/07/24

Keywords

CPRD
SHAP
feature importance
interpretability
lung cancer
machine learning

ASJC Scopus subject areas

Software
Artificial Intelligence

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1109/IJCNN60899.2024.10650819

Cite this

Rai, T., Shen, Y., He, J., Mahmud, M., Brown, D. J., Kaur, J., O'Dowd, E., Baldwin, D. R., & Hubbard, R. (2024). Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data. In 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings (Proceedings of the International Joint Conference on Neural Networks). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IJCNN60899.2024.10650819

@inproceedings{8b2a18a87c5b4ba2bb2fa5fe33f5fe2f,

title = "Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data",

abstract = "Machine learning (ML) models in healthcare are increasing but the lack of interpretability of these models results in them not being suitable for use in clinical practice. In the medical field, it is vital to clarify to clinicians and patients the rationale behind a model's high probability prediction for a specific disease in an individual patient. This transparency fosters trust, facilitates informed decision-making, and empowers both clinicians and patients to understand the underlying factors driving the model's output. This paper aims to incorporate explainability to ML models such as Random Forest (RF), eXtreme Gradient Boosting (XGBoost) and Multilyer Perceptron (MLP) for using with Clinical Practice Research Datalink (CPRD) data and interpret them in terms of feature importance to identify the top most features when distinguishing between lung cancer and non-lung cancer cases. The SHapley Additive exPlanations (SHAP) method has been used in this work to interpret the models. We use SHAP to gain insights into explaining individual predictions as well as interpreting them globally. The feature importance from SHAP is compared with the default feature importance of the models to identify any discrepancies between the results. Based on experimental findings, it has been found that the default feature importance from the tree-based models and SHAP is consistent with features 'age' and 'smoking status' which serve as the top features for predicting lung cancer among patients. Additionally, this work pinpoints that feature importance for a single patient may vary leading to a varied prediction depending on the employed model. Finally, the work concludes that individual-level explanation of feature importance is crucial in mission-critical applications like healthcare to better understand personal health and lifestyle factors in the early prediction of diseases that may lead to terminal illness.",

keywords = "CPRD, SHAP, feature importance, interpretability, lung cancer, machine learning",

author = "Teena Rai and Yuan Shen and Jun He and Mufti Mahmud and Brown, {David J.} and Jaspreet Kaur and Emma O'Dowd and Baldwin, {David R.} and Richard Hubbard",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 2024 International Joint Conference on Neural Networks, IJCNN 2024 ; Conference date: 30-06-2024 Through 05-07-2024",

year = "2024",

doi = "10.1109/IJCNN60899.2024.10650819",

language = "English",

series = "Proceedings of the International Joint Conference on Neural Networks",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings",

address = "United States",

}

Rai, T, Shen, Y, He, J, Mahmud, M, Brown, DJ, Kaur, J, O'Dowd, E, Baldwin, DR & Hubbard, R 2024, Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data. in 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings. Proceedings of the International Joint Conference on Neural Networks, Institute of Electrical and Electronics Engineers Inc., 2024 International Joint Conference on Neural Networks, IJCNN 2024, Yokohama, Japan, 30/06/24. https://doi.org/10.1109/IJCNN60899.2024.10650819

Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data. / Rai, Teena; Shen, Yuan; He, Jun et al.
2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2024. (Proceedings of the International Joint Conference on Neural Networks).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data

AU - Rai, Teena

AU - Shen, Yuan

AU - He, Jun

AU - Mahmud, Mufti

AU - Brown, David J.

AU - Kaur, Jaspreet

AU - O'Dowd, Emma

AU - Baldwin, David R.

AU - Hubbard, Richard

PY - 2024

Y1 - 2024

N2 - Machine learning (ML) models in healthcare are increasing but the lack of interpretability of these models results in them not being suitable for use in clinical practice. In the medical field, it is vital to clarify to clinicians and patients the rationale behind a model's high probability prediction for a specific disease in an individual patient. This transparency fosters trust, facilitates informed decision-making, and empowers both clinicians and patients to understand the underlying factors driving the model's output. This paper aims to incorporate explainability to ML models such as Random Forest (RF), eXtreme Gradient Boosting (XGBoost) and Multilyer Perceptron (MLP) for using with Clinical Practice Research Datalink (CPRD) data and interpret them in terms of feature importance to identify the top most features when distinguishing between lung cancer and non-lung cancer cases. The SHapley Additive exPlanations (SHAP) method has been used in this work to interpret the models. We use SHAP to gain insights into explaining individual predictions as well as interpreting them globally. The feature importance from SHAP is compared with the default feature importance of the models to identify any discrepancies between the results. Based on experimental findings, it has been found that the default feature importance from the tree-based models and SHAP is consistent with features 'age' and 'smoking status' which serve as the top features for predicting lung cancer among patients. Additionally, this work pinpoints that feature importance for a single patient may vary leading to a varied prediction depending on the employed model. Finally, the work concludes that individual-level explanation of feature importance is crucial in mission-critical applications like healthcare to better understand personal health and lifestyle factors in the early prediction of diseases that may lead to terminal illness.

AB - Machine learning (ML) models in healthcare are increasing but the lack of interpretability of these models results in them not being suitable for use in clinical practice. In the medical field, it is vital to clarify to clinicians and patients the rationale behind a model's high probability prediction for a specific disease in an individual patient. This transparency fosters trust, facilitates informed decision-making, and empowers both clinicians and patients to understand the underlying factors driving the model's output. This paper aims to incorporate explainability to ML models such as Random Forest (RF), eXtreme Gradient Boosting (XGBoost) and Multilyer Perceptron (MLP) for using with Clinical Practice Research Datalink (CPRD) data and interpret them in terms of feature importance to identify the top most features when distinguishing between lung cancer and non-lung cancer cases. The SHapley Additive exPlanations (SHAP) method has been used in this work to interpret the models. We use SHAP to gain insights into explaining individual predictions as well as interpreting them globally. The feature importance from SHAP is compared with the default feature importance of the models to identify any discrepancies between the results. Based on experimental findings, it has been found that the default feature importance from the tree-based models and SHAP is consistent with features 'age' and 'smoking status' which serve as the top features for predicting lung cancer among patients. Additionally, this work pinpoints that feature importance for a single patient may vary leading to a varied prediction depending on the employed model. Finally, the work concludes that individual-level explanation of feature importance is crucial in mission-critical applications like healthcare to better understand personal health and lifestyle factors in the early prediction of diseases that may lead to terminal illness.

KW - CPRD

KW - SHAP

KW - feature importance

KW - interpretability

KW - lung cancer

KW - machine learning

UR - http://www.scopus.com/inward/record.url?scp=85204950793&partnerID=8YFLogxK

U2 - 10.1109/IJCNN60899.2024.10650819

DO - 10.1109/IJCNN60899.2024.10650819

M3 - Conference contribution

AN - SCOPUS:85204950793

T3 - Proceedings of the International Joint Conference on Neural Networks

BT - 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2024 International Joint Conference on Neural Networks, IJCNN 2024

Y2 - 30 June 2024 through 5 July 2024

ER -

Rai T, Shen Y, He J, Mahmud M, Brown DJ, Kaur J et al. Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data. In 2024 International Joint Conference on Neural Networks, IJCNN 2024 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2024. (Proceedings of the International Joint Conference on Neural Networks). doi: 10.1109/IJCNN60899.2024.10650819

Understanding Feature Importance of Prediction Models Based on Lung Cancer Primary Care Data

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this