Lung cancer is one of the most deadly cancers in the world. Its mortality rate is high when the cancer is diagnosed late. Therefore, early detection is a crucial factor for an increase in survival rate, and lung cancer screening is one of the most important intervention tools. However, the screening would be cost-effective only when we can accurately select a sub-population which is at the most risk of lung cancer. It is hypothesised that this selection task can be done cost-effectively when we use clinical data (e.g. demographic, lifestyle and comorbidity variables) rather than low-dose CT. This work used the clinical data extracted from Clinical Practice Research Datalink (CPRD). The goal is to test whether this approach can achieve comparable or even better selection performance when compared to an alternative approach using clinical data from lung cancer screening trials. The latter approach is adopted in . In this paper, we further adapt the logistic regression model for a joint classification and feature selection analysis. The model is implemented in an ‘ensemble learning’ manner to deal with severe ‘class imbalance’ problems. It is observed that the sensitivity and specificity results are slightly better than those reported in . Also, we identified a comorbidity factor COPD and a smoking-related factor smk-status as the two most discriminative features.