An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data

Research output: Journal PublicationArticlepeer-review

3 Citations (Scopus)

Abstract

A high dimensional low sample size (HDLSS) dataset typically contains many features but a limited number of samples. It is commonly found in domains such as microarray data and medical imaging. When sample size is small, the population probability density function (PDF) of a HDLSS dataset may not be well represented, causing difficulties of applying feature selection or feature extraction methods for HDLSS data classification. In this paper, we explore the possibility of designing feature selection and feature extraction methods for HDLSS data classification by making loose assumption on the underlying PDF of a HDLSS dataset. Specifically, we propose to leverage on Correlation Explanation (CorEx), a recent unsupervised probabilistic graphical model that assumes (hierarchical) hidden structure for generating subsets of features that are conditionally independent. We benchmark the proposed method against frequently cited Information Theory based feature extraction and feature selection methods, including Conditional Infomax Feature Extraction (CIFE), Maximum Relevance Minimum Redundancy (MRMR), Maximization of Mutual Information (MMI), Infomax Independent Component Analysis (Infomax ICA),and Kernel Entropy Component Analysis (KECA). The HDLSS datasets used in this study are Breast Cancer Dataset by Gravier et. al and West et. al, Colon Cancer dataset by Alon et. al., Leukemia Dataset by Golub et.al and the Gisette Dataset used by Guyon et. al. Experimental results demonstrate that the proposed method shows some improvement in classification performance over MMI, and Infomax ICA and is competitive with MRMR and CIFE.

Original languageEnglish
Article number9424570
Pages (from-to)69157-69172
Number of pages16
JournalIEEE Access
Volume9
DOIs
Publication statusPublished - 2021
Externally publishedYes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Free Keywords

  • correlation explanation
  • feature extraction
  • High dimensional low sample size (HDLSS) datasets
  • information theory

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data'. Together they form a unique fingerprint.

Cite this