AbstractThis dissertation presents our research on single cell classification with single cell transcriptomics (SCT) data and purely supervised machine learning (ML) method artificial neural network (ANN).
SCT sequencing technology can accurately capture the instantaneous gene expression of every single cell. The 10x SCT technology has realized SCT profiling in a high-throughput and cost-efficient manner. It can produce over 109 transcripts of over 105 individual cells with ~33,000 gene features, for profiling a targeted sample in a single study. However, the classification of single cells with SCT data has met challenges. These include: the lack of supervised ML methods in single cell classification, the lack of reference datasets for SCT gene expression profiles, the lack of a specific cell ontology for single cell classification, the characteristic of SCT data - large data size, high-dimensional, the sparsity (a large proportion of zero-counts), and the presence of variables (biological and technical). The currently used unsupervised ML methods have shown the limitation on generalization and manual inspection to annotation.
In addressing the needs and challenges, considering the capability of generalization and the suitability to large data size, high-dimensional, sparse, and high-variety SCT data, we made the hypothesis that single cell classification can be done with the supervised ML method ANN and SCT data. We selected peripheral blood mononuclear cells (PBMC) as the SCT data sample for this study. PBMC is a conventionally used predictive health indicator, it has five main cell types that are naturally isolated. The accurate classification of SCT data of the five cell types can be used in early disease diagnosis and the realization of accurate blood testing based on SCT analysis.
We prepared standardized 56 reference datasets for PBMC SCT classification and described a multi-dimensional cell ontology with over 163 dimensions for single cell classification, with PBMC as an example.
In the initial study, the proof of concept that using the supervised ML method ANN and standardized SCT data to realize single cell classification has been demonstrated, with an overall accuracy of 89.4%. Follow-up, we deployed holdout internal cross-validation, external validation, added data validation, together with cyclical incremental learning method, and newly collected independent SCT datasets from four sources, to investigate the baseline for highly accurate PBMC SCT classification. The overall accuracy of the 4-class classification was 93.0%, and the 5-class classification achieved 94.6%. The classification results have been analyzed with PBMC SCT cell ontology and basic statistics. B cells, monocytes, and T cells had classification accuracy that was greater than 95%. Due to similarities between NK cells and T cell subsets, the classification accuracy of NK cells was maintained at roughly 75%. The accuracy of dendritic cells was limited due to the small proportion of numbers in the training sets.
Based on these, we studied the effect of various processing protocols of SCT data on single cell classification. The findings indicated that datasets from samples with minimally processing protocols (PBMC separation only) helped in the identification of SCT gene expression patterns.
Further, we explored the vulnerability of ANN-SCT-PBMC classifiers, using 17 non-representative datasets of five different confounding factor groups, and 17 rounds of cyclical four-supersets-swapping external validation experiments. The results revealed that when trained with sufficient reference datasets, the ANN-SCT-PBMC model was robust and could survive a small number of non-representative instances hidden in the training set. The model can recognize and assess the representativeness of SCT data once it has been trained on purified high-quality reference data. The proportions of reference and non-representative datasets, the distribution of classes in training and testing sets, the similarity of gene expression between cell types and subtypes, the characteristics of non-representative datasets, etc. are variables that had an impact on model vulnerability.
This research gives a solution to the current “eleven grand challenges” of SCT data analysis. It demonstrates that purely supervised ML ANN is a viable option for classifying cell types from single cell expression data, with generalization capability and robustness on various upcoming data sets. This research reveals that sufficient reference SCT data, generated with precise and strict protocols and labeled with a complete and detailed multi-dimensional cell ontology, is required for highly accurate single cell classification, that can contribute to future predictive health development and hematology development.
|Date of Award
|15 Jul 2023
|Vladimir Brusic (Supervisor), Heshan Du (Supervisor) & Huan Jin (Supervisor)
- single cell classification
- single cell transcriptomics (SCT) data
- supervised machine learning (ML)
- artificial neural network (ANN)
- peripheral blood mononuclear cells (PBMC)
- multi-dimensional cell ontology
- proof of concept
- incremental learning
- model vulnerability
- data representativeness
- model robustness