Titre : |
FEATURES EXTRACTION FROM BIOLOGICAL DATA “CHOOSING ADEQUATE NON-LINEAR METHODS” |
Type de document : |
texte imprimé |
Auteurs : |
Imadeddine Zeghouda, Auteur ; Mohamed Abderrachid Louail ; Mekroud,Noureddine, Directeur de thèse |
Editeur : |
Setif:UFA |
Année de publication : |
2024 |
Importance : |
1 vol (79 f .) |
Format : |
29 cm |
Langues : |
Anglais (eng) |
Catégories : |
Thèses & Mémoires:Informatique
|
Mots-clés : |
Bioinformatics
Cancer Classification
Gene Expression
RNA sequences
Dimensionality Reduction
Kernel PCA
SVM
Centroid Meta-Feature |
Index. décimale : |
004 - Informatique |
Résumé : |
Cancer, a widespread and formidable disease, claims millions of lives globally, making it one
of the most feared afflictions worldwide. Early detection significantly improves treatment
outcomes and survival rates. RNA sequence analysis is a crucial method for cancer early
detection. To address the challenges posed by the high dimensionality of RNA-seq DataSets in
cancer classification, the proposed approach utilizes Kernel PCA and centroid meta-features as
feature extraction methods to reduce the feature space size. To evaluate the impact of these
reduction techniques on the classification process, we use two popular classifiers, namely KNN
and soft-margin SVM. The goal is to classify different types of cancer based on tumor RNA
sequence (RNA-Seq) gene expression data. Specifically, we investigate kidney renal clear cell
carcinoma (KIRC), breast invasive carcinoma (BRCA), lung squamous cell carcinoma (LUSC),
lung adenocarcinoma (LUAD), and uterine corpus endometrial carcinoma (UCEC) in this
research. The obtained results demonstrate that the proposed dimensionality reduction method,
KPCA followed by soft-margin SVM classifier, achieved the highest overall testing accuracy of
97.84%. Additionally, for centroid meta-features, the best results were obtained using the SVM
classifier, achieving an accuracy of 94.97%. We used the most promising hybridization, KPCASVM,
to identify the top genes contributing to the understanding of the molecular basis of
various cancers, facilitating the discovery of potential biomarkers for early diagnosis. |
Note de contenu : |
Sommaire
General introduction .................................................................................................... 8
CHAPTER 1 : Bioinformatics ....................................................................................... 9
1. Introduction ................................................................................................................................................ 10
2. Basic biological concepts .............................................................................................................................. 10
2.1 Bioinformatics definition ........................................................................................................................ 10
2.2 DNA ..................................................................................................................................................... 10
2.3 RNA ...................................................................................................................................................... 11
2.4 RNA-seq ............................................................................................................................................... 12
2.5 Gene ...................................................................................................................................................... 12
2.6 Cancer gene .............................................................................................................................................. 12
2.7 Gene expression ......................................................................................................................................... 13
3. Objectives of Bio-informatics ....................................................................................................................... 13
4. History of bioinformatics .............................................................................................................................. 13
5. Bioinformatics fields of applications ............................................................................................................ 15
6. Contributions of bioinformatics & Machine learning to cancer research ..................................................... 15
7. Cancer research and RNA-databases ........................................................................................................... 15
8. Conclusion .................................................................................................................................................. 17
CHAPTER 2 : Machine learning ................................................................................18
1. Introduction.................................................................................................................................... 19
2. Machine learning principles ......................................................................................................................... 19
2.1 Definition .................................................................................................................................................. 19
2.2 Machine learning process ......................................................................................................................... 19
2.3 Machine learning categories .................................................................................................................... 21
2.3.1 Supervised learning ............................................................................................................................... 21
2.3.2 unsupervised learning ............................................................................................................................. 21
2.3.3 reinforcement learning ............................................................................................................................ 22
2.3.4 semi-supervised learning ..................................................................................................................... 22
2.4 Why is dimensionality reduction important in machine learning and predictive modeling ............................ 23
2.4.1 approach of dimension reduction ............................................................................................................. 23
3 Machine learning algorithms ............................................................................................................................ 24
3.1 Supervised algorithms ............................................................................................................................ 24
3.1.1 classification...................................................................................................................................... 24
3.1.2 regression .......................................................................................................................................... 29
4 Graphical overview of machine learning algorithms ........................................................................... 30
5 conclusion ........................................................................................................................................................ 33
CHAPTER 3: used methods and tools & proposed approach .....................................35
Part one : methods of features extraction ........................................................... 36
1. Importance of dimension reduction in machine learning .............................................................................36
2. Presentation of the two dimension reduction techniques .............................................................................. 36
2.1 Kernel PCA .............................................................................................................................................. 37
2.1.1 The mathematical foundation of kernel PCA ....................................................................................... 38
2.1.2 advantages and application fields ....................................................................................................... 39
2.1.3 Comparison with traditional PCA ....................................................................................................... 39
2.2 Metafeatures ............................................................................................................................................. 39
2.2.1 presenting metafeatures ...................................................................................................................... 39
2.2.2 Different metafeatures methods ............................................................................................................. 43
3 Efficient Support Vector Machine for rendered linear data ................................................................................ 43
3.1 Linear svm ................................................................................................................................................... 43
3.2 Application and advantages of linear svm ...................................................................................................... 44
3.3 Advantages of soft margin over rigid margin ................................................................................................ 45
4 personalization of K-nearest neighbor distances ................................................................................................. 45
4.1 distance & metrics .................................................................................................................................. 45
4.2 influence of Choosing adequate distances in KNN performances ............................................................... 46
Part two : details of proposed approach ............................................................ 47
1 State of the art and critics .............................................................................................................. 47
2 Contribution and advantages of proposed approach .................................................................................... 47
3 Kernel functions used for proposed reduction via PCA ............................................................................... 48
4 distances used for KNN classification proposed .......................................................................................... 51
5 pseudo algorithm ....................................................................................................................................... 51
6 conclusion.................................................................................................................................................. 53
CHAPTER 4:Implementation & Interpretation of results ..........................................54
1. Introduction.................................................................................................................................... 55
2. Working environment .................................................................................................................................. 55
2.1 Kaggle as Hardware environment .......................................................................................................... 55
2.2 Development tools .................................................................................................................................... 56
3. Tumor gene expression dataset ..................................................................................................................... 57
4. Implementation of proposed approach .......................................................................................................... 57
4.1 loading the dataset ...................................................................................................................................... 58
4.2 standard Preprocessing ............................................................................................................................... 58
4.2.1 splitting the dataset ............................................................................................................................... 58
4.2.2 normalization ....................................................................................................................................... 59
4.3 implementation of dimension reduction of proposed algorithm ...................................................................... 59
4.3.1 Feature extraction via Centroid metafeatures ........................................................................................ 62
4.3.2 Feature extraction via KPCA ................................................................................................................ 62
5. Cumulative explained variance.......................................................................................................................... 63
6. Implementation of proposed classifier ................................................................................................................. 63
6.1 Classification .......................................................................................................................................... 63
6.2 Hyper parameters tuning .......................................................................................................................... 63
7 Evaluation Results & Performance Metrics ............................................................................................... 65
7.1 Confusion matrix ..................................................................................................................................... 67
7.2 Learning curves ...................................................................................................................................... 67
7.3 Rate of dimension reduction for each algorithm ........................................................................................ 69
7.4 Execution time ......................................................................................................................................... 69
8 Biological Interpretation and Discussion of Results ................................................................................... 70
8.1 biological interpretation of extracted principle components ................................................................... 70
9 Conclusion ................................................................................................................................................ 71 |
Côte titre : |
MAI/0898 |
FEATURES EXTRACTION FROM BIOLOGICAL DATA “CHOOSING ADEQUATE NON-LINEAR METHODS” [texte imprimé] / Imadeddine Zeghouda, Auteur ; Mohamed Abderrachid Louail ; Mekroud,Noureddine, Directeur de thèse . - [S.l.] : Setif:UFA, 2024 . - 1 vol (79 f .) ; 29 cm. Langues : Anglais ( eng)
Catégories : |
Thèses & Mémoires:Informatique
|
Mots-clés : |
Bioinformatics
Cancer Classification
Gene Expression
RNA sequences
Dimensionality Reduction
Kernel PCA
SVM
Centroid Meta-Feature |
Index. décimale : |
004 - Informatique |
Résumé : |
Cancer, a widespread and formidable disease, claims millions of lives globally, making it one
of the most feared afflictions worldwide. Early detection significantly improves treatment
outcomes and survival rates. RNA sequence analysis is a crucial method for cancer early
detection. To address the challenges posed by the high dimensionality of RNA-seq DataSets in
cancer classification, the proposed approach utilizes Kernel PCA and centroid meta-features as
feature extraction methods to reduce the feature space size. To evaluate the impact of these
reduction techniques on the classification process, we use two popular classifiers, namely KNN
and soft-margin SVM. The goal is to classify different types of cancer based on tumor RNA
sequence (RNA-Seq) gene expression data. Specifically, we investigate kidney renal clear cell
carcinoma (KIRC), breast invasive carcinoma (BRCA), lung squamous cell carcinoma (LUSC),
lung adenocarcinoma (LUAD), and uterine corpus endometrial carcinoma (UCEC) in this
research. The obtained results demonstrate that the proposed dimensionality reduction method,
KPCA followed by soft-margin SVM classifier, achieved the highest overall testing accuracy of
97.84%. Additionally, for centroid meta-features, the best results were obtained using the SVM
classifier, achieving an accuracy of 94.97%. We used the most promising hybridization, KPCASVM,
to identify the top genes contributing to the understanding of the molecular basis of
various cancers, facilitating the discovery of potential biomarkers for early diagnosis. |
Note de contenu : |
Sommaire
General introduction .................................................................................................... 8
CHAPTER 1 : Bioinformatics ....................................................................................... 9
1. Introduction ................................................................................................................................................ 10
2. Basic biological concepts .............................................................................................................................. 10
2.1 Bioinformatics definition ........................................................................................................................ 10
2.2 DNA ..................................................................................................................................................... 10
2.3 RNA ...................................................................................................................................................... 11
2.4 RNA-seq ............................................................................................................................................... 12
2.5 Gene ...................................................................................................................................................... 12
2.6 Cancer gene .............................................................................................................................................. 12
2.7 Gene expression ......................................................................................................................................... 13
3. Objectives of Bio-informatics ....................................................................................................................... 13
4. History of bioinformatics .............................................................................................................................. 13
5. Bioinformatics fields of applications ............................................................................................................ 15
6. Contributions of bioinformatics & Machine learning to cancer research ..................................................... 15
7. Cancer research and RNA-databases ........................................................................................................... 15
8. Conclusion .................................................................................................................................................. 17
CHAPTER 2 : Machine learning ................................................................................18
1. Introduction.................................................................................................................................... 19
2. Machine learning principles ......................................................................................................................... 19
2.1 Definition .................................................................................................................................................. 19
2.2 Machine learning process ......................................................................................................................... 19
2.3 Machine learning categories .................................................................................................................... 21
2.3.1 Supervised learning ............................................................................................................................... 21
2.3.2 unsupervised learning ............................................................................................................................. 21
2.3.3 reinforcement learning ............................................................................................................................ 22
2.3.4 semi-supervised learning ..................................................................................................................... 22
2.4 Why is dimensionality reduction important in machine learning and predictive modeling ............................ 23
2.4.1 approach of dimension reduction ............................................................................................................. 23
3 Machine learning algorithms ............................................................................................................................ 24
3.1 Supervised algorithms ............................................................................................................................ 24
3.1.1 classification...................................................................................................................................... 24
3.1.2 regression .......................................................................................................................................... 29
4 Graphical overview of machine learning algorithms ........................................................................... 30
5 conclusion ........................................................................................................................................................ 33
CHAPTER 3: used methods and tools & proposed approach .....................................35
Part one : methods of features extraction ........................................................... 36
1. Importance of dimension reduction in machine learning .............................................................................36
2. Presentation of the two dimension reduction techniques .............................................................................. 36
2.1 Kernel PCA .............................................................................................................................................. 37
2.1.1 The mathematical foundation of kernel PCA ....................................................................................... 38
2.1.2 advantages and application fields ....................................................................................................... 39
2.1.3 Comparison with traditional PCA ....................................................................................................... 39
2.2 Metafeatures ............................................................................................................................................. 39
2.2.1 presenting metafeatures ...................................................................................................................... 39
2.2.2 Different metafeatures methods ............................................................................................................. 43
3 Efficient Support Vector Machine for rendered linear data ................................................................................ 43
3.1 Linear svm ................................................................................................................................................... 43
3.2 Application and advantages of linear svm ...................................................................................................... 44
3.3 Advantages of soft margin over rigid margin ................................................................................................ 45
4 personalization of K-nearest neighbor distances ................................................................................................. 45
4.1 distance & metrics .................................................................................................................................. 45
4.2 influence of Choosing adequate distances in KNN performances ............................................................... 46
Part two : details of proposed approach ............................................................ 47
1 State of the art and critics .............................................................................................................. 47
2 Contribution and advantages of proposed approach .................................................................................... 47
3 Kernel functions used for proposed reduction via PCA ............................................................................... 48
4 distances used for KNN classification proposed .......................................................................................... 51
5 pseudo algorithm ....................................................................................................................................... 51
6 conclusion.................................................................................................................................................. 53
CHAPTER 4:Implementation & Interpretation of results ..........................................54
1. Introduction.................................................................................................................................... 55
2. Working environment .................................................................................................................................. 55
2.1 Kaggle as Hardware environment .......................................................................................................... 55
2.2 Development tools .................................................................................................................................... 56
3. Tumor gene expression dataset ..................................................................................................................... 57
4. Implementation of proposed approach .......................................................................................................... 57
4.1 loading the dataset ...................................................................................................................................... 58
4.2 standard Preprocessing ............................................................................................................................... 58
4.2.1 splitting the dataset ............................................................................................................................... 58
4.2.2 normalization ....................................................................................................................................... 59
4.3 implementation of dimension reduction of proposed algorithm ...................................................................... 59
4.3.1 Feature extraction via Centroid metafeatures ........................................................................................ 62
4.3.2 Feature extraction via KPCA ................................................................................................................ 62
5. Cumulative explained variance.......................................................................................................................... 63
6. Implementation of proposed classifier ................................................................................................................. 63
6.1 Classification .......................................................................................................................................... 63
6.2 Hyper parameters tuning .......................................................................................................................... 63
7 Evaluation Results & Performance Metrics ............................................................................................... 65
7.1 Confusion matrix ..................................................................................................................................... 67
7.2 Learning curves ...................................................................................................................................... 67
7.3 Rate of dimension reduction for each algorithm ........................................................................................ 69
7.4 Execution time ......................................................................................................................................... 69
8 Biological Interpretation and Discussion of Results ................................................................................... 70
8.1 biological interpretation of extracted principle components ................................................................... 70
9 Conclusion ................................................................................................................................................ 71 |
Côte titre : |
MAI/0898 |
|