Titre : |
Deep Generative Models in Omics Data Augmentation for Disease Classification |
Type de document : |
texte imprimé |
Auteurs : |
Boutheyna Khenouche, Auteur ; Imene Madoui ; Abderrahim Lakehal, Directeur de thèse |
Editeur : |
Setif:UFA |
Année de publication : |
2024 |
Importance : |
1 vol (115 f .) |
Format : |
29 cm |
Langues : |
Anglais (eng) |
Catégories : |
Thèses & Mémoires:Informatique
|
Mots-clés : |
Informatique |
Index. décimale : |
004 - Informatique |
Résumé : |
The intersection of bioinformatics and machine learning has opened new possibilities
in cancer research, particularly in the analysis and classification of omics
data, which is often which are frequently limited and diverse. This thesis explores
the application of deep generative models, specifically Generative Adversarial Networks
(GANs), to augment omics data, including both gene and protein expression
data, for improved disease classification. Cancer, being one of the most prevalent
and complex diseases, presents significant challenges that require advanced analytical
approaches. Traditional machine learning methods often struggle with highdimensional
and limited datasets, whereas deep learning offers a promising alternative
due to its ability to learn intricate patterns. Our approach involves collecting
and preprocessing gene expression data from 11,070 samples and protein expression
data from 7,790 samples, generating synthetic data through GANs , and training
classifiers like Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons
(MLPs) on the augmented datasets for the prediction of overall survival (OS) and
progression-free interval (PFI). We validate the data generated by our GAN through
biological classification of genes and proteins, and by comprehensive data visualization
techniques. |
Note de contenu : |
Sommaire
1 Bioinformatics , Machine Learning and Deep Learning 10
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Origin of Bioinformatics . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Bioinformatics Subfields . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Bioinformatics data types and Databases . . . . . . . . . . . . 13
1.2.5 Bioinfromatics Application Domain . . . . . . . . . . . . . . . 15
1.2.6 Benefits of Bioinformatics in Healthcare . . . . . . . . . . . . 16
1.3 Machine Learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Machine learning techniques . . . . . . . . . . . . . . . . . . 17
1.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.2 Neural Network in Deep Learning . . . . . . . . . . . . . . . . 20
1.5 Generative Models in machine Learning and deep learning . . . . . . 22
1.5.1 Generative Adversarial Networks (GANs) . . . . . . . . . . . . 22
1.5.2 Variational AutoEncoders (VAEs) . . . . . . . . . . . . . . . . 25
1.6 Machine Learning Vs Deep Learning . . . . . . . . . . . . . . . . . . 27
1.7 Application Domain of ML and DL in Health Care . . . . . . . . . . 29
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Disease classification and data augmentation: Literature review 31
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Omics data, Machine learning and medical research . . . . . . . . . . 32
2.2.1 Omics data: Definition and types . . . . . . . . . . . . . . . . 32
2.2.2 Overview on omics data and machine learning . . . . . . . . . 32
2.3 Related work in Generative model for data augmentation . . . . . . . 33
2.4 Generative models for spatial data augmentation . . . . . . . . . . . 34
2.5 Generative model for omics data augmentation . . . . . . . . . . . . . 41
2.6 Generative adversarial network for omics data augmentation in cancer
classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Deep Generative models for omics data augmentation: Cancer case
study 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Analyzing Gene and Protein Data: From Collection to Insight . . . . 47
3.3 Proposed Architecture for Data Augmentation and Classification . . 47
3.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Gene Expression data . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Protein Expression Data Pre-processing . . . . . . . . . . . . . 53
3.6 Dimensionality Reduction for Gene Expression Data: . . . . . . . . . 54
3.7 Generative adversarial network for data augmentation . . . . . . . . 56
3.7.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.2 Model elaboration . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.3 Training and evaluation . . . . . . . . . . . . . . . . . . . . . 57
3.8 Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . 58
3.8.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8.2 Model Definition and Building . . . . . . . . . . . . . . . . . 58
3.8.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . 58
3.9 Biological Classification of Genes and Proteins . . . . . . . . . . . . . 59
3.9.1 Objective and Methodology . . . . . . . . . . . . . . . . . . . 59
3.10 Multi-Layer Perceptron Implementation . . . . . . . . . . . . . . . . 59
3.10.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10.2 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10.3 Model Building and Predictions for OS . . . . . . . . . . . . . 59
3.10.4 Model Building and Predictions for PFI . . . . . . . . . . . . 59
3.10.5 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.11 conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Implementation and evaluation 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Experimental tools and Packages . . . . . . . . . . . . . . . . . . . . 62
4.2.1 PYTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Evaluation Metrics for Classification Models . . . . . . . . . . . . . . 64
4.3.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.5 Matthews Correlation Coefficient (MCC) . . . . . . . . . . . 65
4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Implementation and Dataset Evaluation . . . . . . . . . . . . . . . . 66
4.5.1 Pre-processing:Gene Expression . . . . . . . . . . . . . . 66
4.5.2 Pre-processing :Protein Expression . . . . . . . . . . . . 67
4.5.3 Dimensionality Reduction: Gene Expression . . . . . . 68
4.5.4 Implement Generative Adversarial Networks (GANs) . . . . . 70
4.5.5 Implement Convolutional Neural Networks . . . . . . . . . . . 75
4.6 Implement Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . 79
4.7 Biological Classification of Genes and Proteins . . . . . . . . . . . . . 82
4.7.1 Tools and Techniques . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 MLP Model for Distinguishing Upregulated and Downregulated Genes/Proteins 83
4.9 Evaluating Prediction Accuracy and Updating Gene and Protein Expression
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.10 Data Visualisation and Interpretation . . . . . . . . . . . . . . . . . . 86
4.12 Comparison of CNN and MLP Performance on Augmented Gene Expression
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.12.1 Gene Expression Data . . . . . . . . . . . . . . . . . . . . . 90
4.12.2 Protein Expression Data . . . . . . . . . . . . . . . . . . . . . 91
4.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
.1 First Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
.1.1 Pre-processing:Gene Expression . . . . . . . . . . . . . . 95
.1.2 Pre-processing :Protein Expression . . . . . . . . . . . . 100
.2 Second Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
.2.1 Implement Generative Adversarial Networks (GANs) . . . . . 105
.3 Third Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
.3.1 Implementing Predictions and Labeling for Generated Data . 106 |
Côte titre : |
MAI/0926
|
Deep Generative Models in Omics Data Augmentation for Disease Classification [texte imprimé] / Boutheyna Khenouche, Auteur ; Imene Madoui ; Abderrahim Lakehal, Directeur de thèse . - [S.l.] : Setif:UFA, 2024 . - 1 vol (115 f .) ; 29 cm. Langues : Anglais ( eng)
Catégories : |
Thèses & Mémoires:Informatique
|
Mots-clés : |
Informatique |
Index. décimale : |
004 - Informatique |
Résumé : |
The intersection of bioinformatics and machine learning has opened new possibilities
in cancer research, particularly in the analysis and classification of omics
data, which is often which are frequently limited and diverse. This thesis explores
the application of deep generative models, specifically Generative Adversarial Networks
(GANs), to augment omics data, including both gene and protein expression
data, for improved disease classification. Cancer, being one of the most prevalent
and complex diseases, presents significant challenges that require advanced analytical
approaches. Traditional machine learning methods often struggle with highdimensional
and limited datasets, whereas deep learning offers a promising alternative
due to its ability to learn intricate patterns. Our approach involves collecting
and preprocessing gene expression data from 11,070 samples and protein expression
data from 7,790 samples, generating synthetic data through GANs , and training
classifiers like Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons
(MLPs) on the augmented datasets for the prediction of overall survival (OS) and
progression-free interval (PFI). We validate the data generated by our GAN through
biological classification of genes and proteins, and by comprehensive data visualization
techniques. |
Note de contenu : |
Sommaire
1 Bioinformatics , Machine Learning and Deep Learning 10
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Origin of Bioinformatics . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Bioinformatics Subfields . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Bioinformatics data types and Databases . . . . . . . . . . . . 13
1.2.5 Bioinfromatics Application Domain . . . . . . . . . . . . . . . 15
1.2.6 Benefits of Bioinformatics in Healthcare . . . . . . . . . . . . 16
1.3 Machine Learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Machine learning techniques . . . . . . . . . . . . . . . . . . 17
1.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.2 Neural Network in Deep Learning . . . . . . . . . . . . . . . . 20
1.5 Generative Models in machine Learning and deep learning . . . . . . 22
1.5.1 Generative Adversarial Networks (GANs) . . . . . . . . . . . . 22
1.5.2 Variational AutoEncoders (VAEs) . . . . . . . . . . . . . . . . 25
1.6 Machine Learning Vs Deep Learning . . . . . . . . . . . . . . . . . . 27
1.7 Application Domain of ML and DL in Health Care . . . . . . . . . . 29
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Disease classification and data augmentation: Literature review 31
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Omics data, Machine learning and medical research . . . . . . . . . . 32
2.2.1 Omics data: Definition and types . . . . . . . . . . . . . . . . 32
2.2.2 Overview on omics data and machine learning . . . . . . . . . 32
2.3 Related work in Generative model for data augmentation . . . . . . . 33
2.4 Generative models for spatial data augmentation . . . . . . . . . . . 34
2.5 Generative model for omics data augmentation . . . . . . . . . . . . . 41
2.6 Generative adversarial network for omics data augmentation in cancer
classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Deep Generative models for omics data augmentation: Cancer case
study 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Analyzing Gene and Protein Data: From Collection to Insight . . . . 47
3.3 Proposed Architecture for Data Augmentation and Classification . . 47
3.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Gene Expression data . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Protein Expression Data Pre-processing . . . . . . . . . . . . . 53
3.6 Dimensionality Reduction for Gene Expression Data: . . . . . . . . . 54
3.7 Generative adversarial network for data augmentation . . . . . . . . 56
3.7.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.2 Model elaboration . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.3 Training and evaluation . . . . . . . . . . . . . . . . . . . . . 57
3.8 Convolutional Neural Networks. . . . . . . . . . . . . . . . . . . . . . 58
3.8.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8.2 Model Definition and Building . . . . . . . . . . . . . . . . . 58
3.8.3 Model Training and Evaluation . . . . . . . . . . . . . . . . . 58
3.9 Biological Classification of Genes and Proteins . . . . . . . . . . . . . 59
3.9.1 Objective and Methodology . . . . . . . . . . . . . . . . . . . 59
3.10 Multi-Layer Perceptron Implementation . . . . . . . . . . . . . . . . 59
3.10.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10.2 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.10.3 Model Building and Predictions for OS . . . . . . . . . . . . . 59
3.10.4 Model Building and Predictions for PFI . . . . . . . . . . . . 59
3.10.5 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.11 conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Implementation and evaluation 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Experimental tools and Packages . . . . . . . . . . . . . . . . . . . . 62
4.2.1 PYTHON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Evaluation Metrics for Classification Models . . . . . . . . . . . . . . 64
4.3.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.5 Matthews Correlation Coefficient (MCC) . . . . . . . . . . . 65
4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Implementation and Dataset Evaluation . . . . . . . . . . . . . . . . 66
4.5.1 Pre-processing:Gene Expression . . . . . . . . . . . . . . 66
4.5.2 Pre-processing :Protein Expression . . . . . . . . . . . . 67
4.5.3 Dimensionality Reduction: Gene Expression . . . . . . 68
4.5.4 Implement Generative Adversarial Networks (GANs) . . . . . 70
4.5.5 Implement Convolutional Neural Networks . . . . . . . . . . . 75
4.6 Implement Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . 79
4.7 Biological Classification of Genes and Proteins . . . . . . . . . . . . . 82
4.7.1 Tools and Techniques . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 MLP Model for Distinguishing Upregulated and Downregulated Genes/Proteins 83
4.9 Evaluating Prediction Accuracy and Updating Gene and Protein Expression
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.10 Data Visualisation and Interpretation . . . . . . . . . . . . . . . . . . 86
4.12 Comparison of CNN and MLP Performance on Augmented Gene Expression
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.12.1 Gene Expression Data . . . . . . . . . . . . . . . . . . . . . 90
4.12.2 Protein Expression Data . . . . . . . . . . . . . . . . . . . . . 91
4.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
.1 First Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
.1.1 Pre-processing:Gene Expression . . . . . . . . . . . . . . 95
.1.2 Pre-processing :Protein Expression . . . . . . . . . . . . 100
.2 Second Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
.2.1 Implement Generative Adversarial Networks (GANs) . . . . . 105
.3 Third Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
.3.1 Implementing Predictions and Labeling for Generated Data . 106 |
Côte titre : |
MAI/0926
|
|