University Sétif 1 FERHAT ABBAS Faculty of Sciences
Détail de l'auteur
Auteur Abdelouahab Moussaoui |
Documents disponibles écrits par cet auteur



Advancing Medical Image Analysis: Integrating U-Net, SegFormer, and SAM Models for Enhanced Semantic Segmentation / Taranim Attallah
Titre : Advancing Medical Image Analysis: Integrating U-Net, SegFormer, and SAM Models for Enhanced Semantic Segmentation Type de document : document électronique Auteurs : Taranim Attallah, Auteur ; Mohamed Fadhel Mansouri, Auteur ; Abdelouahab Moussaoui, Directeur de thèse Editeur : Sétif:UFA1 Année de publication : 2024 Importance : 1 vol (60 f .) Format : 29 cm Langues : Anglais (eng) Catégories : Thèses & Mémoires:Informatique Mots-clés : Colorectal Cancer
Artificial Intelligence in Healthcare
Deep Learning
Medical Image SegmentationIndex. décimale : 004 Informatique Résumé :
This study presents an innovative approach for colorectal cancer detection based on
the CoNIC Challenge dataset. We have developed an ensemble model that uses the
architectures U-Net, Segformer, and SAM for segmentation and classification. Data
preprocessing and augmentation techniques are employed to enhance the model’s generalization
and robustness. Comparative analysis with traditional deep learning models
demonstrates the performance of the proposed model in terms of precision, recall, and
F1-score, achieving a precision of 98.15% and an accuracy of 96.69%. Furthermore, the
model exhibits efficient execution performance, making it suitable for real-world clinical
applications. This research contributes to the advancement of medical diagnostics
by providing a promising solution for colorectal cancer detection.Note de contenu : Sommaire
Abstract i
Resumé ii
Table of contents ix
List of figures x
List of tables xi
Abbreviations xii
Introduction 1
1 Theoretical Background 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Machine Learning definition . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Deep learning architectures . . . . . . . . . . . . . . . . . . . . 9
1.3.3 U-Net: Convolutional Networks for Biomedical Image Segmentation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 SegFormer: Transformer-based Segmentation Model . . . . . . . 13
1.3.5 SAM (Self-Attention Mechanism) in Image Analysis . . . . . . 14
1.4 Algorithm and Model Evaluation Strategies . . . . . . . . . . . . . . . 16
1.4.1 Introduction to Evaluation Strategies . . . . . . . . . . . . . . . 16
1.4.2 Train-Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.4 Holdout Validation . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Evaluating Deep Learning VS Traditional Machine Learning Approaches
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Data Complexity and Representation . . . . . . . . . . . . . . . 17
1.5.2 Model Complexity and Structure . . . . . . . . . . . . . . . . . 18
1.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.4 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.5 Training time and computing capabilities . . . . . . . . . . . . 19
1.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.3 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.4 Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.5 Data Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Artificial Intelligence in Colorectal Cancer 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Anatomy and Physiology of the Colon and Rectum . . . . . . . 22
2.2.2 Epidemiology of Colorectal Cancer . . . . . . . . . . . . . . . . 24
2.2.3 Current Diagnostic Challenges . . . . . . . . . . . . . . . . . . 25
2.3 AI in Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Machine learning in Colorectal Cancer . . . . . . . . . . . . . . 26
2.3.2 Deep learning in Colorectal Cancer . . . . . . . . . . . . . . . . 26
2.3.3 Challenges and limitations . . . . . . . . . . . . . . . . . . . . . 27
2.4 Histopathology Image Generation . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Histopathological Semantic Segmentation . . . . . . . . . . . . 29
2.5 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Experiments and Results 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Dataset and Preprocessing Contributions . . . . . . . . . . . . . 34
3.2.2 Tools and Frameworks . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Experimental Results and Model Evaluation . . . . . . . . . . . . . . . 40
3.3.1 The proposed Approach and Model Architecture . . . . . . . . 40
3.3.2 Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Results and Anaylsis . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Dataset Complexity . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Computational Constraints . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Model Integration . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Future Works and prespectives . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Côte titre : MAI/0953 Advancing Medical Image Analysis: Integrating U-Net, SegFormer, and SAM Models for Enhanced Semantic Segmentation [document électronique] / Taranim Attallah, Auteur ; Mohamed Fadhel Mansouri, Auteur ; Abdelouahab Moussaoui, Directeur de thèse . - [S.l.] : Sétif:UFA1, 2024 . - 1 vol (60 f .) ; 29 cm.
Langues : Anglais (eng)
Catégories : Thèses & Mémoires:Informatique Mots-clés : Colorectal Cancer
Artificial Intelligence in Healthcare
Deep Learning
Medical Image SegmentationIndex. décimale : 004 Informatique Résumé :
This study presents an innovative approach for colorectal cancer detection based on
the CoNIC Challenge dataset. We have developed an ensemble model that uses the
architectures U-Net, Segformer, and SAM for segmentation and classification. Data
preprocessing and augmentation techniques are employed to enhance the model’s generalization
and robustness. Comparative analysis with traditional deep learning models
demonstrates the performance of the proposed model in terms of precision, recall, and
F1-score, achieving a precision of 98.15% and an accuracy of 96.69%. Furthermore, the
model exhibits efficient execution performance, making it suitable for real-world clinical
applications. This research contributes to the advancement of medical diagnostics
by providing a promising solution for colorectal cancer detection.Note de contenu : Sommaire
Abstract i
Resumé ii
Table of contents ix
List of figures x
List of tables xi
Abbreviations xii
Introduction 1
1 Theoretical Background 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Machine Learning definition . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Deep learning architectures . . . . . . . . . . . . . . . . . . . . 9
1.3.3 U-Net: Convolutional Networks for Biomedical Image Segmentation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 SegFormer: Transformer-based Segmentation Model . . . . . . . 13
1.3.5 SAM (Self-Attention Mechanism) in Image Analysis . . . . . . 14
1.4 Algorithm and Model Evaluation Strategies . . . . . . . . . . . . . . . 16
1.4.1 Introduction to Evaluation Strategies . . . . . . . . . . . . . . . 16
1.4.2 Train-Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.4 Holdout Validation . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Evaluating Deep Learning VS Traditional Machine Learning Approaches
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Data Complexity and Representation . . . . . . . . . . . . . . . 17
1.5.2 Model Complexity and Structure . . . . . . . . . . . . . . . . . 18
1.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.4 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.5 Training time and computing capabilities . . . . . . . . . . . . 19
1.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.3 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.4 Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.5 Data Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Artificial Intelligence in Colorectal Cancer 22
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Anatomy and Physiology of the Colon and Rectum . . . . . . . 22
2.2.2 Epidemiology of Colorectal Cancer . . . . . . . . . . . . . . . . 24
2.2.3 Current Diagnostic Challenges . . . . . . . . . . . . . . . . . . 25
2.3 AI in Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Machine learning in Colorectal Cancer . . . . . . . . . . . . . . 26
2.3.2 Deep learning in Colorectal Cancer . . . . . . . . . . . . . . . . 26
2.3.3 Challenges and limitations . . . . . . . . . . . . . . . . . . . . . 27
2.4 Histopathology Image Generation . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Histopathological Semantic Segmentation . . . . . . . . . . . . 29
2.5 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Experiments and Results 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Dataset and Preprocessing Contributions . . . . . . . . . . . . . 34
3.2.2 Tools and Frameworks . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Experimental Results and Model Evaluation . . . . . . . . . . . . . . . 40
3.3.1 The proposed Approach and Model Architecture . . . . . . . . 40
3.3.2 Training Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 Results and Anaylsis . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Dataset Complexity . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Computational Constraints . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Model Integration . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Future Works and prespectives . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Côte titre : MAI/0953 Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité MAI/0953 MAI/0953 Mémoire Bibliothéque des sciences Anglais Disponible
DisponibleAnalyse de sentiment dans les réseaux sociaux, nouvelle stratégie a base d'ensemble de classifieurs / Keddad, walid
![]()
Titre : Analyse de sentiment dans les réseaux sociaux, nouvelle stratégie a base d'ensemble de classifieurs Type de document : texte imprimé Auteurs : Keddad, walid ; Abdelouahab Moussaoui, Directeur de thèse Editeur : Setif:UFA Année de publication : 2017 Importance : 1 vol (64f.) Format : 29 cm Langues : Français (fre) Catégories : Thèses & Mémoires:Informatique Mots-clés : Ingénierie de Données
Technologies Web
réseaux sociaux
Analyse de sentimentIndex. décimale : 004 Informatique Résumé : Abstract
Twitter sentiment analysis, the process of automatically extracting sentiment conveyed
by Twitter data, is a field that has seen a dramatic increase in research in recent years. The
goal of this master thesis is to develop a new machine learning model based on ensemble
learning to classify Twitter messages with respect to their sentiment. Sentiment can be
divided into three classes: positive, negative and neutral.
To compare our new model, several machine learning methods were used during experimentation sessions: Artificial Neural Network, Multinomial Naive Bayes, Support Vector
Machines, Random Forest, Logistic Regression and others. Besides, we tried to compare
different techniques for preprocessing natural language in order to find those that have an
impact on building accurate classifiers. To this purpose we applied Bag-of-Words model
(vector of unigrams), Bag-of-N-grams model (vector of bigrams and vector of trigrams) to
represent text data in suitable numeric format. Bag-of-unigrams and Bag-of-bigrams models
showed the best results for all methods and influenced in a positive way the overall accuracy.
The best performance was achieved by our new model, for both two class (positive and
negative) and three class (positive, negative and neutral) classification. Our new model
achieved an accuracy of 90.06% on two class classification and 78.21% on three class classification.Note de contenu :
Contents
1 Introduction 1
1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Machine Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Machine Learning types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.5 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Literature Review 21
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Sentiment Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Levels of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Sentiment Analysis Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Different Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Lexicon based method . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Machine learning method . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Datasets and implementation frameworks 29
4.1 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Final Data representation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Experiments and results 46
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Proposed ensemble classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Two Classes: Positive and Negative . . . . . . . . . . . . . . . . . . . 47
5.3.2 Three Classes: Positive, Negative and Neutral . . . . . . . . . . . . . 51
6 Discussion 55
6.1 Two Classes: Positive and Negative . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Three Classes: Positive, Negative and Neutral . . . . . . . . . . . . . . . . . 56
7 Conclusion 58Côte titre : MAI/0205 En ligne : https://drive.google.com/file/d/1xz1H6CxpjAAAQRHw1Bdejyxbgg4aoJVH/view?usp=shari [...] Format de la ressource électronique : Analyse de sentiment dans les réseaux sociaux, nouvelle stratégie a base d'ensemble de classifieurs [texte imprimé] / Keddad, walid ; Abdelouahab Moussaoui, Directeur de thèse . - [S.l.] : Setif:UFA, 2017 . - 1 vol (64f.) ; 29 cm.
Langues : Français (fre)
Catégories : Thèses & Mémoires:Informatique Mots-clés : Ingénierie de Données
Technologies Web
réseaux sociaux
Analyse de sentimentIndex. décimale : 004 Informatique Résumé : Abstract
Twitter sentiment analysis, the process of automatically extracting sentiment conveyed
by Twitter data, is a field that has seen a dramatic increase in research in recent years. The
goal of this master thesis is to develop a new machine learning model based on ensemble
learning to classify Twitter messages with respect to their sentiment. Sentiment can be
divided into three classes: positive, negative and neutral.
To compare our new model, several machine learning methods were used during experimentation sessions: Artificial Neural Network, Multinomial Naive Bayes, Support Vector
Machines, Random Forest, Logistic Regression and others. Besides, we tried to compare
different techniques for preprocessing natural language in order to find those that have an
impact on building accurate classifiers. To this purpose we applied Bag-of-Words model
(vector of unigrams), Bag-of-N-grams model (vector of bigrams and vector of trigrams) to
represent text data in suitable numeric format. Bag-of-unigrams and Bag-of-bigrams models
showed the best results for all methods and influenced in a positive way the overall accuracy.
The best performance was achieved by our new model, for both two class (positive and
negative) and three class (positive, negative and neutral) classification. Our new model
achieved an accuracy of 90.06% on two class classification and 78.21% on three class classification.Note de contenu :
Contents
1 Introduction 1
1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Machine Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Machine Learning types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.5 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Literature Review 21
3.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Sentiment Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Levels of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Sentiment Analysis Difficulties . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Different Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Lexicon based method . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Machine learning method . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Datasets and implementation frameworks 29
4.1 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Final Data representation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.4 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Experiments and results 46
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Proposed ensemble classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Two Classes: Positive and Negative . . . . . . . . . . . . . . . . . . . 47
5.3.2 Three Classes: Positive, Negative and Neutral . . . . . . . . . . . . . 51
6 Discussion 55
6.1 Two Classes: Positive and Negative . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Three Classes: Positive, Negative and Neutral . . . . . . . . . . . . . . . . . 56
7 Conclusion 58Côte titre : MAI/0205 En ligne : https://drive.google.com/file/d/1xz1H6CxpjAAAQRHw1Bdejyxbgg4aoJVH/view?usp=shari [...] Format de la ressource électronique : Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité MAI/0205 MAI/0205 Mémoire Bibliothéque des sciences Français Disponible
DisponibleAttention-Based Deep Convolutional Neural Network Versus Transfer Learning for Medical Image Classification and Disease Diagnosis / Maroua Azouz
![]()
Titre : Attention-Based Deep Convolutional Neural Network Versus Transfer Learning for Medical Image Classification and Disease Diagnosis Type de document : texte imprimé Auteurs : Maroua Azouz, Auteur ; Nour Deghoul, Auteur ; Abdelouahab Moussaoui, Directeur de thèse Année de publication : 2022 Importance : 1 vol (94 f .) Format : 29cm Langues : Français (fre) Catégories : Thèses & Mémoires:Informatique Mots-clés : Transfer Learning
Image Classification
ViTIndex. décimale : 004 Informatique Résumé :
Cancer is a disease in which some of the body’s cells grow uncontrollably and spread
to other parts of the body. recently the most two common type of diseases in our city
Setif are Skin cancer and nodules in thyroid gland cancer. Skin Cancer Is the Cancer
You Can See, this sentence is the the motto of Skin Cancer Foundation .When strange
grains appear in the patient’s skin, he cannot differentiate whether it is cancer or just
simple skin ulcers, so in most cases it is not detected except in advanced stages of the
disease . Most patients with nodules in the thyroid gland are afraid to turning these
nodes into cancerous masses, and with the spread of this disease in recent years, this
question has become a concern for all patients.
Recently, the world has become highly focused on deep learning and classification
of medical images by building stable models in computer aided diagnosis [8], most often
the model is using Convolutional neural networks. Now we propose a model based
on the attention mechanism that based on the most important features in the image.
The attention mechanism contributes to increasing the effectiveness of the model and
achieving a better classification. The aim of this study is to improve the accuracy of a
computer-aided diagnosis approach that medical professionals can easily use as an aid.
In this thesis, we proposed the use of the transformer and transfer-learning mechanism
to detect and classify skin cancer diseases and the type of thyroid nodules(benign or malignant).
We collected features from three types of pre-trained models, EfficientNetB7,
VGGNet16, Xception as feature extraction roles. Then as using the features as input to
Vision Transformer (ViT) and then using neural networks for classification.As a result,
the proposed approach achieved accuracy of 83.74% for skin dataset and 76.18% for
thyroid dataset, and an other similar model using CNN model proposed by us is achieve
92.95% of accuracy for skin dataset and 88.18% of accuracy for thyroid dataset . We
proposed, also the of use transfer learning for develop the pre trained model and applied
the same datasets on ResNet50 and our new ResNet50,these models achieve 67.66%,
98.89% respectively for skin cancer data set and they achieve 67.03%, 90.97% respectively
by the use of thyroid dataset.Côte titre : MAI/0587 En ligne : https://drive.google.com/file/d/1x9lI7fG4IlSYzf0xGs-E7AL9ynLyGu1h/view?usp=share [...] Format de la ressource électronique : Attention-Based Deep Convolutional Neural Network Versus Transfer Learning for Medical Image Classification and Disease Diagnosis [texte imprimé] / Maroua Azouz, Auteur ; Nour Deghoul, Auteur ; Abdelouahab Moussaoui, Directeur de thèse . - 2022 . - 1 vol (94 f .) ; 29cm.
Langues : Français (fre)
Catégories : Thèses & Mémoires:Informatique Mots-clés : Transfer Learning
Image Classification
ViTIndex. décimale : 004 Informatique Résumé :
Cancer is a disease in which some of the body’s cells grow uncontrollably and spread
to other parts of the body. recently the most two common type of diseases in our city
Setif are Skin cancer and nodules in thyroid gland cancer. Skin Cancer Is the Cancer
You Can See, this sentence is the the motto of Skin Cancer Foundation .When strange
grains appear in the patient’s skin, he cannot differentiate whether it is cancer or just
simple skin ulcers, so in most cases it is not detected except in advanced stages of the
disease . Most patients with nodules in the thyroid gland are afraid to turning these
nodes into cancerous masses, and with the spread of this disease in recent years, this
question has become a concern for all patients.
Recently, the world has become highly focused on deep learning and classification
of medical images by building stable models in computer aided diagnosis [8], most often
the model is using Convolutional neural networks. Now we propose a model based
on the attention mechanism that based on the most important features in the image.
The attention mechanism contributes to increasing the effectiveness of the model and
achieving a better classification. The aim of this study is to improve the accuracy of a
computer-aided diagnosis approach that medical professionals can easily use as an aid.
In this thesis, we proposed the use of the transformer and transfer-learning mechanism
to detect and classify skin cancer diseases and the type of thyroid nodules(benign or malignant).
We collected features from three types of pre-trained models, EfficientNetB7,
VGGNet16, Xception as feature extraction roles. Then as using the features as input to
Vision Transformer (ViT) and then using neural networks for classification.As a result,
the proposed approach achieved accuracy of 83.74% for skin dataset and 76.18% for
thyroid dataset, and an other similar model using CNN model proposed by us is achieve
92.95% of accuracy for skin dataset and 88.18% of accuracy for thyroid dataset . We
proposed, also the of use transfer learning for develop the pre trained model and applied
the same datasets on ResNet50 and our new ResNet50,these models achieve 67.66%,
98.89% respectively for skin cancer data set and they achieve 67.03%, 90.97% respectively
by the use of thyroid dataset.Côte titre : MAI/0587 En ligne : https://drive.google.com/file/d/1x9lI7fG4IlSYzf0xGs-E7AL9ynLyGu1h/view?usp=share [...] Format de la ressource électronique : Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité MAI/0587 MAI/0587 Mémoire Bibliothéque des sciences Anglais Disponible
Disponible
Titre : Biomedical Data Analysis by Deep Architectures Type de document : texte imprimé Auteurs : Tcheir ,Abir, Auteur ; Abdelouahab Moussaoui, Directeur de thèse Editeur : Setif:UFA Année de publication : 2021 Importance : 1 vol (59 f .) Format : 29 cm Langues : Français (fre) Catégories : Thèses & Mémoires:Informatique Mots-clés : OVID-19
chestX-ray
DeepLearning
AttentionmapsIndex. décimale : 004 - Informatique Résumé :
OVID-19causeslunginflammationandlesions,andchestX-rayimagesareremarkably
suitable fordifferentiatingthenewdiseasefrompatientswithotherlungdiseases.
In thispaper,weproposeacomputermodeltoclassifyX-rayimagesofpatientsdiagnosed
with COVID-19.Thedatasetsutilizedinthisexperimentaretwo.Firstly,adatasetof9545X-ray
images including4045imageswithconfirmedCovid-19disease,and5500imagesofNonCovid-19.
Secondly,adatasetof13677X-rayimagesincluding3424imageswithconfirmedCovid-19disease,
1345 imageswithconfirmedviralpneumonia,and8908imagesofnormalconditions.Theresults
suggest thatDeepLearningwithX-rayimagingmayextractsignificantbiomarkersrelated
to theCovid-19disease,Thisworkhasconsideredthewellknownpre-trainedarchitectures,
suchasEfficientNetB0,DenseNet121,Vgg16,ResNet50,InceptionV3andMobileNetV2forthe
experimental evaluation.
The performanceoftheconsideredarchitecturesisevaluatedbycomputingthecommonper-
formance measures.TheresultoftheexperimentalevaluationconfirmsthattheEfficientNetB0
pre-trained transferlearning-basedmodelofferedbetterclassificationaccuracy(98.40%)onthe
considered imagedataset1classificationand(97.20%)ontheconsideredimagedataset2,were
also generatedAttentionmapsforprediction,whichrepresentsakeyexplanatorystepaimedat
increasing confidenceinthefinaldecision.Côte titre : MAI/0528 En ligne : https://drive.google.com/file/d/1pF3-IE1TmVOr85fp2nF7Z0Wrn9T3QRSk/view?usp=shari [...] Format de la ressource électronique : Biomedical Data Analysis by Deep Architectures [texte imprimé] / Tcheir ,Abir, Auteur ; Abdelouahab Moussaoui, Directeur de thèse . - [S.l.] : Setif:UFA, 2021 . - 1 vol (59 f .) ; 29 cm.
Langues : Français (fre)
Catégories : Thèses & Mémoires:Informatique Mots-clés : OVID-19
chestX-ray
DeepLearning
AttentionmapsIndex. décimale : 004 - Informatique Résumé :
OVID-19causeslunginflammationandlesions,andchestX-rayimagesareremarkably
suitable fordifferentiatingthenewdiseasefrompatientswithotherlungdiseases.
In thispaper,weproposeacomputermodeltoclassifyX-rayimagesofpatientsdiagnosed
with COVID-19.Thedatasetsutilizedinthisexperimentaretwo.Firstly,adatasetof9545X-ray
images including4045imageswithconfirmedCovid-19disease,and5500imagesofNonCovid-19.
Secondly,adatasetof13677X-rayimagesincluding3424imageswithconfirmedCovid-19disease,
1345 imageswithconfirmedviralpneumonia,and8908imagesofnormalconditions.Theresults
suggest thatDeepLearningwithX-rayimagingmayextractsignificantbiomarkersrelated
to theCovid-19disease,Thisworkhasconsideredthewellknownpre-trainedarchitectures,
suchasEfficientNetB0,DenseNet121,Vgg16,ResNet50,InceptionV3andMobileNetV2forthe
experimental evaluation.
The performanceoftheconsideredarchitecturesisevaluatedbycomputingthecommonper-
formance measures.TheresultoftheexperimentalevaluationconfirmsthattheEfficientNetB0
pre-trained transferlearning-basedmodelofferedbetterclassificationaccuracy(98.40%)onthe
considered imagedataset1classificationand(97.20%)ontheconsideredimagedataset2,were
also generatedAttentionmapsforprediction,whichrepresentsakeyexplanatorystepaimedat
increasing confidenceinthefinaldecision.Côte titre : MAI/0528 En ligne : https://drive.google.com/file/d/1pF3-IE1TmVOr85fp2nF7Z0Wrn9T3QRSk/view?usp=shari [...] Format de la ressource électronique : Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité MAI/0528 MAI/0528 Mémoire Bibliothéque des sciences Français Disponible
Disponible
Titre : Catégorisation Automatique Contextuelle de Documents Semi-structurés Multilingues Type de document : texte imprimé Auteurs : Gadri, said, Auteur ; Abdelouahab Moussaoui, Directeur de thèse Editeur : Setif:UFA Année de publication : 2016 Importance : 1 vol (207 f .) Format : 29 cm Catégories : Informatique Mots-clés : Catégorisation Automatique Contextuelle Documents Semi-structurés Multilingues Résumé : Résumé
La catégorisation de textes est une tache très importante dans le processus de text mining. Cette tache
consiste à affecter un ensemble de textes à un autre ensemble de catégories selon leurs thèmes et en
exploitant les algorithmes d’apprentissage connus dans le domaine d’intelligence artificielle. Notre
étude sur cet axe de recherche nous a permis de proposer quelques solutions et de porter certaines
contributions, notamment: Proposer un algorithme simple, rapide et efficace pour identifier la langue
d’un texte dans un corpus multilingue. Développer un algorithme amélioré pour la recherche des
racines des mots arabes en se basant sur une approche complètement statistique. L’objectif principal
de cet algorithme est de réduire la taille du vocabulaire de termes et par conséquent améliorer la
qualité de la catégorisation obtenue dans le domaine de la catégorisation de textes et augmenter
l’efficacité de la recherche dans le domaine de la recherche d’information. Développer un nouveau
stemmer multilingue qui est plus général et indépendant de toute langue. Application d’une nouvelle
panoplie de pseudo-distances pour catégoriser les textes d’un corpus de grande taille (Reuters21578).
Toutes ces solutions étaient l’objet de papiers scientifiques publiés dans des conférences et des
journaux internationaux indexés.Note de contenu : Table of Content
Aknowledgments I
Dedications II
Abstract III
Author Biography IV
Introduction 1 - 6
1. Scope of the work ………………………………………………………….……………….1
2. Problematic …………………………………………………………………...…………….2
3. Our contribution ………………………………………………………………………….2-3
3.1. Language identification …………………………………………………….…………….2
3.2. Contextual text categorization ………………………………………………...………….3
3.3. Arabic Stemming …………………………………………………………...…………….3
3.4. Multilingual Stemming …………………………………………………………..……….3
4. Thesis organization ……………………………………………………………...………….4
4.1. A theoretical part ………………………………………………………………………….4
4.2. Contributions part …………………………………………………………..…………….4
5. Author Publications ……………………………………………………………...………….4
Part I: Theoretical Part (1 – 160)
Chapter 1: Data Mining: Basic Concepts and Tools (7 – 40)
1.1.Introduction ……………………………………………………………………………..…7
1.2.Knowledge data discovery from data …………………………………………………...…8
1.3.Some popular definitions of Data Mining …………………………………………………8
1.4.What is Data Mining? An introductive Example …………………………………...……10
1.5.What is not Data Mining …………………………………………………………………11
1.6.Data mining and knowledge Discovery ………………………………….………………12
1.7.Where Data Mining can be placed? Origins ………………………………..……………12
1.8.Data mining motivations …………………………………………………………………14
1.9.Data mining Tasks ………………………………………………………….……………17
1.9.1. Predictive Tasks ………………………………………………………...……………17
1.9.2. Descriptive tasks …………………………………………………………..…………17
1.10. A classic example of data mining use …………………..……………………………17
1.11. Data mining applications ……………………………………………….……………18
1.11.1. Business Problems ……………………………………………………………...……18
1.11.2. Other problems for data mining …………………………………...…………………19
1.12. Principle tasks of Data Mining ………………………………………………….……19
1.12.1. Classification (predictive task) ………………………………………………………19
Some classification applications ………………………………………………………..……20
1.12.2. Clustering (descriptive Task) …………………………………………...……………21
Some clustering applications ………………………………………………………...………22
1.12.3. Association Rule Mining (descriptive task) …………………….……………………23
Some association rule discovery applications …………………………….…………………24
1.12.4. Regression (predictive task) …………………………………………………………25
Some regression applications …………………………………………………………..……25
1.12.5. Anomaly Detection/Deviation analysis (descriptive task) ……………..……………26
Some anomaly detection applications ………………………………………………..………26
1.12.6. Sequential Pattern Mining (descriptive task) ……………………………...…………26
Some pattern discovery applications …………………………………………………………27
1.12.7. Time Series Prediction (Forecasting/predictive task) ………………….……………28
1.12.8. Decision Making …………………………………………………………………..…28
1.13. Data Mining Project Cycle …………………………………………………...………29
1.14. Types of data sets used in Data mining field …………………………...……………33
1.15. Major Vendors and Products …………………………………………………………37
1.16. Text Mining, Web Mining, XML Mining: New applications of Data Mining……….38
1.16.1. Text Mining ……………………………………………………………………..……38
1.16.2. Web Mining ………………………………………………………………..…………38
1.16.3. XML Mining ……………………………………………………………….…………39
1.17. Future perspectives in Data Mining …………………………………….……………39
1.18. Summary …………………………………………………………………………..…40
Chapter 2: Fundamentals of text mining (41 – 70)
2.1. Introduction ……………………………………………………………………….….….42
2.2. Some definitions of Text Mining ……………………………….……………………….42
2.3. Data Mining Vs Text Mining ………………………………………………………...….42
2.4. Structured and unstructured data ………………………………………………..……….43
2.5. Why Text Mining- Motivation ………………………………………………….……….44
2.6. Where Text Mining can be placed? ………………………………………….………….44
2.7. Why Text Mining is hard? Major difficulties ……………………………………..…….45
2.8. Text Mining Applications ……………………………………………………………….46
2.8.1. Document classification ……………………………………………………………….46
2.8.2. Information retrieval ……………………………………………………….………….47
2.8.3. Clustering and organizing document ………………………………………………….48
2.8.4. Information extraction ……………………………………………………………..….49
2.9. Architecture of text mining systems ……………………………………………….……49
2.9.1. General architecture ……………………………………………………………….…49
2.9.2. Functional Architecture ………………………………………………………………50
2.10. Text Mining process step by step …………………………………………….………52
2.10.1. Collecting documents ……………………………………………………………..…52
2.10.2. Text Preprocessing tasks ………………………………………………………..……54
2.10.2.1. Document Standardization …………………………………………….…………54
2.10.2.2. Tokenization …………………………………………………………………...…54
2.10.2.3. Simple Syntactic Analysis ……………………………………….………………54
2.10.2.4. Advanced Linguistic Analysis …………………………………...………………55
a. Part Of Speech (POS) tagging ……………………………………………….………55
b. Syntactical parsing ………………………………………………………………...…57
c. Shallow Parsing ………………………………………………………………...……58
d. Word Sense Disambiguation …………………………………………………………59
2.10.2.5. Lemmatization and stemming ……………………………………………………59
a. Lookup-based Stemming ………………………………………………………..……60
b. Rule-base stemming (affix removal stemming) ………………………………………60
c. Inflectional stemming (lemmatization) ………………………………………………60
d. Stemming to a root ………………………………………………………………...…61
2.10.3. Feature Generation and data representation …………………….……………………61
2.10.3.1. Global dictionary vs. local Dictionary ………………………………...…………61
2.10.3.2. Features reduction ……………………………………………………………..…62
2.10.3.3. Data representation ………………………………………………………………62
a. Binary model …………………………………………………………………………62
b. Three values model ………………………………………………………..…………62
c. Term frequency model (tf) ……………………………………………………………63
d. Tf-idf model (Term Frequency-Inverse Document Frequency) …………...…………63
2.10.3.4. Multiword Features ………………………………………………………………64
2.10.3.5. Labels for the Right Answers ……………………………………….……………64
2.10.3.6. Named Entity Recognition (NER) …………………………………….…………65
2.10.4. Feature selection ……………………………………………………………...……65
2.10.5. Data Mining (pattern discovery) …………………………………………...………65
2.10.5.1. Classification ………………………………………………………………….…66
2.10.5.2. Clustering …………………………………………………………………..……66
a. Jaccard Coefficient ……………………………………………………..……………67
b. Cosine Similarity ………………………………………………………..……………68
c. Cosine Similarity and TF-IDF ………………………………….……………………68
2.10.5.3. Sentiment Analysis ……………………………………………………….………69
2.11. Summary …………………………………………………………………...………70
Chapter 3: Automatic Text Categorization (71 – 96)
3.1. Preface ……………………………………………………………………………...……72
3.2. Definition of the problem ……………………………………………..…………………73
3.2.1. Single-Label versus Multilabel Categorization …………………..……………………73
3.2.2. Document-Pivoted versus Category-Pivoted Categorization …………………………73
3.3. Applications of text categorization ………………………………………………...……73
3.3.1. Indexing of Texts Using Controlled Vocabulary …………………………………..…74
3.3.2. Document Sorting and Text Filtering …………………………………………………74
3.3.3. Hierarchical Web Page Categorization …………………………………………..……74
3.4. Particular difficulties of text categorization …………………………………………..…74
3.4.1. Big size of vectors …………………………………………………………………..…75
3.4.2. Imbalance of classes …………………………………………………………..………75
3.4.3. Ambiguity of terms …………………………………………………………….………75
3.4.4. Problem of synonymy ……………………………………………………….…………75
3.4.5. Subjectivity of decision ………………………………………………………...………75
3.5. How to categorize a monolingual text: the general process of TC ………..……………76
3.6. The problem of document representation ………………………………………….……77
3.6.1. Choice of document features (terms) …………………………………………….……77
3.6.1.1. Representation with bag of words ………………………………………………...…78
3.6.1.2. Representation with Sentences ………………………………………………………78
3.6.1.3. Representation with lexical roots (Stems) ……………………………………...……78
3.6.1.4. Representation with lemmas ……………………………………………………...…78
3.6.1.5. Representation Based on N-grams …………………………………………..………78
3.6.1.6. Conceptual representation ……………………………………………………..……79
3.6.2. Terms coding ……………………………………………………………………..……79
3.6.2.1. Binary model …………………………………………………………...……………79
3.6.2.2. Three values model ……………………………………………….…………………79
3.6.2.3. Term frequency code (tf) ……………………………………………………….……80
3.6.2.4. Tf_idf coding (Term Frequency-Inverse Document Frequency) ………………..…..80
a) Document Frequency (dft) …………………………………………………………...……80
b) Inverse Document Frequency (idft ) ………………………………………….……………81
c) tf-idf coding (Term Frequency-Inverse Document Frequency) ………………...…………81
d) Variants of tf and tf-idf weighting …………………………………………………………82
3.6.2.5. TFC coding ……………………………………………………………………….…82
3.6.2.6. LNU Coding …………………………………………………………………………82
3.6.2.7. The entropy …………………………………………………………..……………………….83
3.7. Feature reduction ………………………………………………………………….…83
3.7.1. Local reduction ………………………………………………………….……….……84
3.7.2. Global reduction ………………………………………………………………………84
3.8. Dimensionality reduction by feature selection ………………………………………84
3.8.1. Document Frequency DF …………………………………………………………...…84
3.8.2. Mutual Information (MI) …………………………………………………………...…85
3.8.3. Information Gain (IG) …………………………………………………………………85
3.8.4. χ
Statistic (Chi-square / Chi-2) ……………………………………….………………86
3.8.5. Weighted Log Likelihood Ratio (WLLR) …………………………….………………87
3.9. Dimensionality Reduction by Feature Extraction …………………………….…………87
3.10. Knowledge engineering approach to TC ………………………………………………87
3.11. Machine learning approach to TC …………………………………………..…………88
3.12. Using unlabeled data to improve classification ……………………………..…………88
3.13. Multilingual text categorization ……………………………………………….………89
3.13.1. Importance of multilingual categorization ………………………………..…………89
3.13.2. Multilingual information retrieval …………………………………………...………91
3.13.2.1.Approaches based on automatic translation ………………………………...………91
3.13.2.2.Approaches based on multilingual thesaurus ………………………………….……92
3.13.2.3.Approaches based on the use of dictionaries. ………………………………………92
3.13.3. Proposed solutions for multilingual text categorization ………………………..……92
3.13.3.1.Scheme 1: The trivial scheme ………………………………………………………92
3.13.3.2.Scheme 2: Using a single language for learning. …………………………...………93
3.13.3.3.Scheme 3: Mix the training sets. ……………………………………………………94
3.13.4. The main phases of multilingual text categorization ………………………...………95
3.13.4.1.Language Identification ………………………………………………………….…95
3.13.4.2.Automatic Translation ………………………………………………………………95
3.13.4.3.Text categorization ……………………………………………………………….…96
3.14. Summary ……………………………………………………………………………96
Chapter 4: Machine Learning Algorithms for text categorization (97 – 140)
4.1. Preface ………………………………………………………………………...…………98
4.2. The text classification problem ……………………………….…………………………98
4.3. Learning algorithms used in text categorization …………………...……………………99
4.3.1. Naïve Bayes classifier …………………………………………………………………99
4.3.1.1. The multinomial model ……………………………………………...………………99
4.3.1.2.The Bernoulli model ………………………………………………………..………103
4.3.1.3.Time complexity of NB classifier ………………………………………..…………105
4.3.1.4.Linear classifiers ……………………………………………………………………105
4.3.2. Rocchio classifier ……………………………………………………………..….…105
4.3.3. k nearest neighbor classifier ……………………………………………...…………108
4.3.3.1. Similarity measures used with kNN algorithm ……………………………….……109
4.3.3.2. Probabilistic kNN ………………………………………………………………..…110
4.3.3.3. Performance of kNN classifier …………………………………………………..…110
4.3.4. Support Vector Machine classifier (SVM classifier) …………………………….…111
4.3.4.1. The linearly separable case ……………………………………………………..…111
4.3.4.2. Nonlinearly separable case and noisy data ………………………………….……115
a) Large margin classification for noisy data ………………………………..………….…115
b) Multiclass SVM …………………………………………………………………………117
c) Nonlinear SVM ……………………………………………………………………….…117
d) What kinds of functions are valid kernel functions? ……………………………………119
e) The new optimizing problem ……………………………………………………………119
f) Examples of kernel functions ……………………………………………………….……120
4.3.4.3. Software Implementations …………………………………………………………121
4.3.5. Decision Tree classifiers. …………………………………………………….………122
4.3.5.1. Algorithm presentation …………………………………………………….………122
4.3.5.2. When we use decision tree learning? ………………………………………………124
4.3.5.3. ID3 algorithm ………………………………………………………………………124
a) Which Attribute Is the Best Classifier? …………………………………….……………124
b) How to create the decision tree using ID3 algorithm? ……………………….…………128
4.3.5.4. C4.5 algorithm ………………………………………………………………..……130
a) How C4.5 works? ………………………………………………………………..………130
b) How are tests chosen? …………………………………………………...………………131
c) How is tree-growing terminated? ………………………………..………………………131
d) How are class labels assigned to the leaves? ……………………………………………131
4.3.6. Decision Rule Classifiers ……………………………………………...……………131
4.3.7. Regression Methods …………………………………………………………...……132
4.3.8. Artificial Neural Networks …………………………………………………..………132
4.3.9. Classifier Committees: Bagging and Boosting ……………………………..………133
A general boosting procedure ………………………………………………...…….134
4.4. Improving classifier performance ……………………………………………….……135
4.5. Evaluation of text classifiers …………………………………………………….……136
4.6. Performance Measures …………………………………………………………..……136
4.6.1. Recall, Precision and F-measure …………………………………………..…………136
4.6.2. Noise and Silence ……………………………………………………………...……137
4.6.3. Micro and Macro Average ……………………………………………….…………138
4.7. Benchmark Collections …………………………………………………….…………139
4.8. Comparison among Classifiers ………………………………………….……………139
4.9. Summary …………………………………………………………………………...…140
Chapter 5: Categorization of Semistructured documents (141 – 160)
5.1. Introduction ……………………………………………………………………….……142
5.2. The web and HTML documents …………………………………….…………………142
5.3. XML Semistructured documents ………………………………………………………142
5.3.1. From a flat document to a structured document …………………………..…………142
5.3.2. XML language ……………………………………………………………………..…143
5.3.3. XML Document ………………………………………………………………………143
5.3.3.1. The DTD (Document Type Definition) ………………………………..……………144
5.3.3.2.XML DOM (XML Document Object Model) ……………………………..…………145
5.3.3.3.XPATH ………………………………………………………………..………….…145
5.3.3.4. Types of XML documents …………………………………….………………….…145
a. XML documents oriented texts ……………………………………………………..……145
b. XML documents oriented data …………………………………………………..………146
5.3.4. Semantic of tags …………………………………………………………………..…146
5.3.4.1. Hard tags ……………………………………………………………………..……146
5.3.4.2. Soft tag ……………………………………………………………………..………147
5.3.4.3.Jump tags …………………………………………………………………...………147
5.4. XML Mining …………………………………………………………………...………147
5.5. Structured information retrieval ……………………………..…………………………148
5.5.1. Problems related to the representation ………………………………………………148
5.5.2. The need to adapt the old models ……………………………….……………………148
5.5.3. The INEX initiative and the proposed solutions ………………………..……………149
5.5.3.1. The CO Queries (Content Only) ……………………………………………...……149
5.5.3. 2. CAS queries (Content and Structure) ………………………………………...……149
5.5.3.3. Evaluation of structured RI system ……………………………………………...…149
a. The component coverage dimension ………………………………………………..……149
b. The topical relevance dimension …………………………………………………………149
5.5.4. Problem of semi-structured documents heterogeneity …………………………….…150
5.5.5. Querying and heterogeneity. …………………………………………………………151
5.5.6. Conversion of document formats ………………………………………………….…153
5.6. Categorization of XML Semistructured documents ……………………………...……153
5.6.1. Approaches based on the structure and the content …………………………….……153
5.6.1.1.[Yang and Zhang, 2007] approach …………………………………………………153
5.6.1.2. [de compos et al, 2007] approach …………………………………………………154
5.6.2. Approaches based on the structure only ………………………………………..……157
5.6.2.1. [Zaki and Aggarwal, 2003] approach ………………………………….…………157
5.6.2.2. [Garboni et al., 2005] ………………………………………………………...……158
5.7. Summary ……………………………………………………………………………….160
Part II: Contribution Part
Chapter 6: Identifying the language of a text in a collection of multilingual documents
6.1. Introduction …………………………………………………………………………….162
6.2. State of the art ………………………………………………………………………….162
6.2.1. Language Identification Approaches …………………………...……………………162
6.2.1.1. The linguistic approach …………………………………………...……………….162
6.2.1.2. The lexical approach ……………………………………………………………….162
6.2.1.3. The grammatical approach ………………………………………….…………….162
6.2.1.4. The statistical approach …………………………………………………..……….162
6.2.2. Principle of Text Segmentation into N-grams of Characters ………………...………163
6.2.3. Advantages of Text Segmentation into N-grams …………………………….………163
6.2.4. Methods Based on N-grams for Language Identification ……………………………163
6.2.4.1. Nearest Neighbors Methods …………………………………………………..……163
The Distance of Beesly …………………………………………………………..….163
The Distance of Cavenar and Trenkle ……………………………………..………164
The Distance of Kullbach-Leibler (KL) ……………………………………….……164
The Distance of khi2 (χ 2) ……………………………………………………...…….164
6.2.4.2. Conventional methods used in categorization ………………………………..……165
6.3. Our proposed method …………………………………………………………………..165
6.4. Experimentations …………………………………………………………………...….166
6.4.1. Training and Test Corpus …………………………………………..……………….166
6.4.2. Pre-processing Performed on Training and Testing corpus …………….……………166
6.4.3. Performed Processing ………………………………………………………….…….166
6.5. Evaluation of obtained results ………………………………………………………….167
6.6. Conclusion and perspectives ……………………………………………………..…….169
Chapter 7: Contextual Categorization Using New Panoply of Similarity Metrics
7.1. Introduction ………………………………………………………….…………………171
7.2. State of the Art …………………………………………………………………………171
7.2.1. Language Identification and Documents Categorization ……………………………171
7.2.1.1. Language Identification ……………………………………………………………171
7.2.1.2. Automatic Categorization of Texts …………………………………………………171
7.2.2. Approaches of Texts Representation ………………………………………..………172
7.2.2.1. Representation with Bag of Words …………………………………………………172
7.2.2.2. Representation Based on N-grams …………………………………………………172
7.2.3. Methods of Texts Categorization …………………………………………….………173
7.2.3.1. Conventional Method ………………………………………………………………173
7.2.3.2. Nearest Neighbors Methods ………………………………………………..………173
7.2.4. Metrics of Similarity Used in Language Identification ………………………………173
7.3. Our Proposed Approach ……………………………………………………………..…174
7.3.1. Application of Metrics of Language Identification in Text Categorization ……….…174
7.3.2. Presentation of the New Method ………………………………………………..……174
7.4. Experimentations ………………………………………………………………………175
7.4.1. Training and Test Corpus …………………………………………………………… 175
7.4.2. Pre-processing Performed on Training and Testing corpus ……………….…………175
7.4.3. Performed Processing ………………………………………………………..………176
7.5. Evaluation of obtained Results ……………………………………………………...…176
7.5.1. Segmentation phase (Tokenization) ………………………………………..……… 176
7.5.2. Learning phase …………………………………………………………………….…177
7.5.3. Interpretation of the obtained results …………………………………...……………179
7.6. Conclusion and perspectives ………………………………………………...…………179
Chapter 8: Arabic Text Categorization: An Improved Stemming Algorithm to Increase
the Quality of Categorization (180 – 188)
8.1. Introduction ……………………………………………………………………...……..181
8.2. Related Work …………………………………………………………………………..182
8.3. The Proposed Algorithm ……………………………………………...………………..183
8.4. Experimentations and Obtained Results ………………………………...……………..184
8.4.1. The Used Dataset …………………………………………………………...………..184
8.4.2. The Obtained Results …………………………………………………….…………..185
8.5. Comparison with Other Algorithms ……………………………………………………187
8.6. Discussion ……………………………………………………………………….……..188
8.7. Conclusion and Perspectives ………………………………………………………...…188
Conclusion …………………………………………………………………….…..…189 – 190
Appendices …………………………………………………………………….…….191 – 195
Bibliography …………………………………………………………………………196 - 207Côte titre : DI/0023 En ligne : https://drive.google.com/file/d/1Qcmm0ks40JlBccKNEeSXc3Oun91tUKu3/view?usp=shari [...] Format de la ressource électronique : Catégorisation Automatique Contextuelle de Documents Semi-structurés Multilingues [texte imprimé] / Gadri, said, Auteur ; Abdelouahab Moussaoui, Directeur de thèse . - [S.l.] : Setif:UFA, 2016 . - 1 vol (207 f .) ; 29 cm.
Catégories : Informatique Mots-clés : Catégorisation Automatique Contextuelle Documents Semi-structurés Multilingues Résumé : Résumé
La catégorisation de textes est une tache très importante dans le processus de text mining. Cette tache
consiste à affecter un ensemble de textes à un autre ensemble de catégories selon leurs thèmes et en
exploitant les algorithmes d’apprentissage connus dans le domaine d’intelligence artificielle. Notre
étude sur cet axe de recherche nous a permis de proposer quelques solutions et de porter certaines
contributions, notamment: Proposer un algorithme simple, rapide et efficace pour identifier la langue
d’un texte dans un corpus multilingue. Développer un algorithme amélioré pour la recherche des
racines des mots arabes en se basant sur une approche complètement statistique. L’objectif principal
de cet algorithme est de réduire la taille du vocabulaire de termes et par conséquent améliorer la
qualité de la catégorisation obtenue dans le domaine de la catégorisation de textes et augmenter
l’efficacité de la recherche dans le domaine de la recherche d’information. Développer un nouveau
stemmer multilingue qui est plus général et indépendant de toute langue. Application d’une nouvelle
panoplie de pseudo-distances pour catégoriser les textes d’un corpus de grande taille (Reuters21578).
Toutes ces solutions étaient l’objet de papiers scientifiques publiés dans des conférences et des
journaux internationaux indexés.Note de contenu : Table of Content
Aknowledgments I
Dedications II
Abstract III
Author Biography IV
Introduction 1 - 6
1. Scope of the work ………………………………………………………….……………….1
2. Problematic …………………………………………………………………...…………….2
3. Our contribution ………………………………………………………………………….2-3
3.1. Language identification …………………………………………………….…………….2
3.2. Contextual text categorization ………………………………………………...………….3
3.3. Arabic Stemming …………………………………………………………...…………….3
3.4. Multilingual Stemming …………………………………………………………..……….3
4. Thesis organization ……………………………………………………………...………….4
4.1. A theoretical part ………………………………………………………………………….4
4.2. Contributions part …………………………………………………………..…………….4
5. Author Publications ……………………………………………………………...………….4
Part I: Theoretical Part (1 – 160)
Chapter 1: Data Mining: Basic Concepts and Tools (7 – 40)
1.1.Introduction ……………………………………………………………………………..…7
1.2.Knowledge data discovery from data …………………………………………………...…8
1.3.Some popular definitions of Data Mining …………………………………………………8
1.4.What is Data Mining? An introductive Example …………………………………...……10
1.5.What is not Data Mining …………………………………………………………………11
1.6.Data mining and knowledge Discovery ………………………………….………………12
1.7.Where Data Mining can be placed? Origins ………………………………..……………12
1.8.Data mining motivations …………………………………………………………………14
1.9.Data mining Tasks ………………………………………………………….……………17
1.9.1. Predictive Tasks ………………………………………………………...……………17
1.9.2. Descriptive tasks …………………………………………………………..…………17
1.10. A classic example of data mining use …………………..……………………………17
1.11. Data mining applications ……………………………………………….……………18
1.11.1. Business Problems ……………………………………………………………...……18
1.11.2. Other problems for data mining …………………………………...…………………19
1.12. Principle tasks of Data Mining ………………………………………………….……19
1.12.1. Classification (predictive task) ………………………………………………………19
Some classification applications ………………………………………………………..……20
1.12.2. Clustering (descriptive Task) …………………………………………...……………21
Some clustering applications ………………………………………………………...………22
1.12.3. Association Rule Mining (descriptive task) …………………….……………………23
Some association rule discovery applications …………………………….…………………24
1.12.4. Regression (predictive task) …………………………………………………………25
Some regression applications …………………………………………………………..……25
1.12.5. Anomaly Detection/Deviation analysis (descriptive task) ……………..……………26
Some anomaly detection applications ………………………………………………..………26
1.12.6. Sequential Pattern Mining (descriptive task) ……………………………...…………26
Some pattern discovery applications …………………………………………………………27
1.12.7. Time Series Prediction (Forecasting/predictive task) ………………….……………28
1.12.8. Decision Making …………………………………………………………………..…28
1.13. Data Mining Project Cycle …………………………………………………...………29
1.14. Types of data sets used in Data mining field …………………………...……………33
1.15. Major Vendors and Products …………………………………………………………37
1.16. Text Mining, Web Mining, XML Mining: New applications of Data Mining……….38
1.16.1. Text Mining ……………………………………………………………………..……38
1.16.2. Web Mining ………………………………………………………………..…………38
1.16.3. XML Mining ……………………………………………………………….…………39
1.17. Future perspectives in Data Mining …………………………………….……………39
1.18. Summary …………………………………………………………………………..…40
Chapter 2: Fundamentals of text mining (41 – 70)
2.1. Introduction ……………………………………………………………………….….….42
2.2. Some definitions of Text Mining ……………………………….……………………….42
2.3. Data Mining Vs Text Mining ………………………………………………………...….42
2.4. Structured and unstructured data ………………………………………………..……….43
2.5. Why Text Mining- Motivation ………………………………………………….……….44
2.6. Where Text Mining can be placed? ………………………………………….………….44
2.7. Why Text Mining is hard? Major difficulties ……………………………………..…….45
2.8. Text Mining Applications ……………………………………………………………….46
2.8.1. Document classification ……………………………………………………………….46
2.8.2. Information retrieval ……………………………………………………….………….47
2.8.3. Clustering and organizing document ………………………………………………….48
2.8.4. Information extraction ……………………………………………………………..….49
2.9. Architecture of text mining systems ……………………………………………….……49
2.9.1. General architecture ……………………………………………………………….…49
2.9.2. Functional Architecture ………………………………………………………………50
2.10. Text Mining process step by step …………………………………………….………52
2.10.1. Collecting documents ……………………………………………………………..…52
2.10.2. Text Preprocessing tasks ………………………………………………………..……54
2.10.2.1. Document Standardization …………………………………………….…………54
2.10.2.2. Tokenization …………………………………………………………………...…54
2.10.2.3. Simple Syntactic Analysis ……………………………………….………………54
2.10.2.4. Advanced Linguistic Analysis …………………………………...………………55
a. Part Of Speech (POS) tagging ……………………………………………….………55
b. Syntactical parsing ………………………………………………………………...…57
c. Shallow Parsing ………………………………………………………………...……58
d. Word Sense Disambiguation …………………………………………………………59
2.10.2.5. Lemmatization and stemming ……………………………………………………59
a. Lookup-based Stemming ………………………………………………………..……60
b. Rule-base stemming (affix removal stemming) ………………………………………60
c. Inflectional stemming (lemmatization) ………………………………………………60
d. Stemming to a root ………………………………………………………………...…61
2.10.3. Feature Generation and data representation …………………….……………………61
2.10.3.1. Global dictionary vs. local Dictionary ………………………………...…………61
2.10.3.2. Features reduction ……………………………………………………………..…62
2.10.3.3. Data representation ………………………………………………………………62
a. Binary model …………………………………………………………………………62
b. Three values model ………………………………………………………..…………62
c. Term frequency model (tf) ……………………………………………………………63
d. Tf-idf model (Term Frequency-Inverse Document Frequency) …………...…………63
2.10.3.4. Multiword Features ………………………………………………………………64
2.10.3.5. Labels for the Right Answers ……………………………………….……………64
2.10.3.6. Named Entity Recognition (NER) …………………………………….…………65
2.10.4. Feature selection ……………………………………………………………...……65
2.10.5. Data Mining (pattern discovery) …………………………………………...………65
2.10.5.1. Classification ………………………………………………………………….…66
2.10.5.2. Clustering …………………………………………………………………..……66
a. Jaccard Coefficient ……………………………………………………..……………67
b. Cosine Similarity ………………………………………………………..……………68
c. Cosine Similarity and TF-IDF ………………………………….……………………68
2.10.5.3. Sentiment Analysis ……………………………………………………….………69
2.11. Summary …………………………………………………………………...………70
Chapter 3: Automatic Text Categorization (71 – 96)
3.1. Preface ……………………………………………………………………………...……72
3.2. Definition of the problem ……………………………………………..…………………73
3.2.1. Single-Label versus Multilabel Categorization …………………..……………………73
3.2.2. Document-Pivoted versus Category-Pivoted Categorization …………………………73
3.3. Applications of text categorization ………………………………………………...……73
3.3.1. Indexing of Texts Using Controlled Vocabulary …………………………………..…74
3.3.2. Document Sorting and Text Filtering …………………………………………………74
3.3.3. Hierarchical Web Page Categorization …………………………………………..……74
3.4. Particular difficulties of text categorization …………………………………………..…74
3.4.1. Big size of vectors …………………………………………………………………..…75
3.4.2. Imbalance of classes …………………………………………………………..………75
3.4.3. Ambiguity of terms …………………………………………………………….………75
3.4.4. Problem of synonymy ……………………………………………………….…………75
3.4.5. Subjectivity of decision ………………………………………………………...………75
3.5. How to categorize a monolingual text: the general process of TC ………..……………76
3.6. The problem of document representation ………………………………………….……77
3.6.1. Choice of document features (terms) …………………………………………….……77
3.6.1.1. Representation with bag of words ………………………………………………...…78
3.6.1.2. Representation with Sentences ………………………………………………………78
3.6.1.3. Representation with lexical roots (Stems) ……………………………………...……78
3.6.1.4. Representation with lemmas ……………………………………………………...…78
3.6.1.5. Representation Based on N-grams …………………………………………..………78
3.6.1.6. Conceptual representation ……………………………………………………..……79
3.6.2. Terms coding ……………………………………………………………………..……79
3.6.2.1. Binary model …………………………………………………………...……………79
3.6.2.2. Three values model ……………………………………………….…………………79
3.6.2.3. Term frequency code (tf) ……………………………………………………….……80
3.6.2.4. Tf_idf coding (Term Frequency-Inverse Document Frequency) ………………..…..80
a) Document Frequency (dft) …………………………………………………………...……80
b) Inverse Document Frequency (idft ) ………………………………………….……………81
c) tf-idf coding (Term Frequency-Inverse Document Frequency) ………………...…………81
d) Variants of tf and tf-idf weighting …………………………………………………………82
3.6.2.5. TFC coding ……………………………………………………………………….…82
3.6.2.6. LNU Coding …………………………………………………………………………82
3.6.2.7. The entropy …………………………………………………………..……………………….83
3.7. Feature reduction ………………………………………………………………….…83
3.7.1. Local reduction ………………………………………………………….……….……84
3.7.2. Global reduction ………………………………………………………………………84
3.8. Dimensionality reduction by feature selection ………………………………………84
3.8.1. Document Frequency DF …………………………………………………………...…84
3.8.2. Mutual Information (MI) …………………………………………………………...…85
3.8.3. Information Gain (IG) …………………………………………………………………85
3.8.4. χ
Statistic (Chi-square / Chi-2) ……………………………………….………………86
3.8.5. Weighted Log Likelihood Ratio (WLLR) …………………………….………………87
3.9. Dimensionality Reduction by Feature Extraction …………………………….…………87
3.10. Knowledge engineering approach to TC ………………………………………………87
3.11. Machine learning approach to TC …………………………………………..…………88
3.12. Using unlabeled data to improve classification ……………………………..…………88
3.13. Multilingual text categorization ……………………………………………….………89
3.13.1. Importance of multilingual categorization ………………………………..…………89
3.13.2. Multilingual information retrieval …………………………………………...………91
3.13.2.1.Approaches based on automatic translation ………………………………...………91
3.13.2.2.Approaches based on multilingual thesaurus ………………………………….……92
3.13.2.3.Approaches based on the use of dictionaries. ………………………………………92
3.13.3. Proposed solutions for multilingual text categorization ………………………..……92
3.13.3.1.Scheme 1: The trivial scheme ………………………………………………………92
3.13.3.2.Scheme 2: Using a single language for learning. …………………………...………93
3.13.3.3.Scheme 3: Mix the training sets. ……………………………………………………94
3.13.4. The main phases of multilingual text categorization ………………………...………95
3.13.4.1.Language Identification ………………………………………………………….…95
3.13.4.2.Automatic Translation ………………………………………………………………95
3.13.4.3.Text categorization ……………………………………………………………….…96
3.14. Summary ……………………………………………………………………………96
Chapter 4: Machine Learning Algorithms for text categorization (97 – 140)
4.1. Preface ………………………………………………………………………...…………98
4.2. The text classification problem ……………………………….…………………………98
4.3. Learning algorithms used in text categorization …………………...……………………99
4.3.1. Naïve Bayes classifier …………………………………………………………………99
4.3.1.1. The multinomial model ……………………………………………...………………99
4.3.1.2.The Bernoulli model ………………………………………………………..………103
4.3.1.3.Time complexity of NB classifier ………………………………………..…………105
4.3.1.4.Linear classifiers ……………………………………………………………………105
4.3.2. Rocchio classifier ……………………………………………………………..….…105
4.3.3. k nearest neighbor classifier ……………………………………………...…………108
4.3.3.1. Similarity measures used with kNN algorithm ……………………………….……109
4.3.3.2. Probabilistic kNN ………………………………………………………………..…110
4.3.3.3. Performance of kNN classifier …………………………………………………..…110
4.3.4. Support Vector Machine classifier (SVM classifier) …………………………….…111
4.3.4.1. The linearly separable case ……………………………………………………..…111
4.3.4.2. Nonlinearly separable case and noisy data ………………………………….……115
a) Large margin classification for noisy data ………………………………..………….…115
b) Multiclass SVM …………………………………………………………………………117
c) Nonlinear SVM ……………………………………………………………………….…117
d) What kinds of functions are valid kernel functions? ……………………………………119
e) The new optimizing problem ……………………………………………………………119
f) Examples of kernel functions ……………………………………………………….……120
4.3.4.3. Software Implementations …………………………………………………………121
4.3.5. Decision Tree classifiers. …………………………………………………….………122
4.3.5.1. Algorithm presentation …………………………………………………….………122
4.3.5.2. When we use decision tree learning? ………………………………………………124
4.3.5.3. ID3 algorithm ………………………………………………………………………124
a) Which Attribute Is the Best Classifier? …………………………………….……………124
b) How to create the decision tree using ID3 algorithm? ……………………….…………128
4.3.5.4. C4.5 algorithm ………………………………………………………………..……130
a) How C4.5 works? ………………………………………………………………..………130
b) How are tests chosen? …………………………………………………...………………131
c) How is tree-growing terminated? ………………………………..………………………131
d) How are class labels assigned to the leaves? ……………………………………………131
4.3.6. Decision Rule Classifiers ……………………………………………...……………131
4.3.7. Regression Methods …………………………………………………………...……132
4.3.8. Artificial Neural Networks …………………………………………………..………132
4.3.9. Classifier Committees: Bagging and Boosting ……………………………..………133
A general boosting procedure ………………………………………………...…….134
4.4. Improving classifier performance ……………………………………………….……135
4.5. Evaluation of text classifiers …………………………………………………….……136
4.6. Performance Measures …………………………………………………………..……136
4.6.1. Recall, Precision and F-measure …………………………………………..…………136
4.6.2. Noise and Silence ……………………………………………………………...……137
4.6.3. Micro and Macro Average ……………………………………………….…………138
4.7. Benchmark Collections …………………………………………………….…………139
4.8. Comparison among Classifiers ………………………………………….……………139
4.9. Summary …………………………………………………………………………...…140
Chapter 5: Categorization of Semistructured documents (141 – 160)
5.1. Introduction ……………………………………………………………………….……142
5.2. The web and HTML documents …………………………………….…………………142
5.3. XML Semistructured documents ………………………………………………………142
5.3.1. From a flat document to a structured document …………………………..…………142
5.3.2. XML language ……………………………………………………………………..…143
5.3.3. XML Document ………………………………………………………………………143
5.3.3.1. The DTD (Document Type Definition) ………………………………..……………144
5.3.3.2.XML DOM (XML Document Object Model) ……………………………..…………145
5.3.3.3.XPATH ………………………………………………………………..………….…145
5.3.3.4. Types of XML documents …………………………………….………………….…145
a. XML documents oriented texts ……………………………………………………..……145
b. XML documents oriented data …………………………………………………..………146
5.3.4. Semantic of tags …………………………………………………………………..…146
5.3.4.1. Hard tags ……………………………………………………………………..……146
5.3.4.2. Soft tag ……………………………………………………………………..………147
5.3.4.3.Jump tags …………………………………………………………………...………147
5.4. XML Mining …………………………………………………………………...………147
5.5. Structured information retrieval ……………………………..…………………………148
5.5.1. Problems related to the representation ………………………………………………148
5.5.2. The need to adapt the old models ……………………………….……………………148
5.5.3. The INEX initiative and the proposed solutions ………………………..……………149
5.5.3.1. The CO Queries (Content Only) ……………………………………………...……149
5.5.3. 2. CAS queries (Content and Structure) ………………………………………...……149
5.5.3.3. Evaluation of structured RI system ……………………………………………...…149
a. The component coverage dimension ………………………………………………..……149
b. The topical relevance dimension …………………………………………………………149
5.5.4. Problem of semi-structured documents heterogeneity …………………………….…150
5.5.5. Querying and heterogeneity. …………………………………………………………151
5.5.6. Conversion of document formats ………………………………………………….…153
5.6. Categorization of XML Semistructured documents ……………………………...……153
5.6.1. Approaches based on the structure and the content …………………………….……153
5.6.1.1.[Yang and Zhang, 2007] approach …………………………………………………153
5.6.1.2. [de compos et al, 2007] approach …………………………………………………154
5.6.2. Approaches based on the structure only ………………………………………..……157
5.6.2.1. [Zaki and Aggarwal, 2003] approach ………………………………….…………157
5.6.2.2. [Garboni et al., 2005] ………………………………………………………...……158
5.7. Summary ……………………………………………………………………………….160
Part II: Contribution Part
Chapter 6: Identifying the language of a text in a collection of multilingual documents
6.1. Introduction …………………………………………………………………………….162
6.2. State of the art ………………………………………………………………………….162
6.2.1. Language Identification Approaches …………………………...……………………162
6.2.1.1. The linguistic approach …………………………………………...……………….162
6.2.1.2. The lexical approach ……………………………………………………………….162
6.2.1.3. The grammatical approach ………………………………………….…………….162
6.2.1.4. The statistical approach …………………………………………………..……….162
6.2.2. Principle of Text Segmentation into N-grams of Characters ………………...………163
6.2.3. Advantages of Text Segmentation into N-grams …………………………….………163
6.2.4. Methods Based on N-grams for Language Identification ……………………………163
6.2.4.1. Nearest Neighbors Methods …………………………………………………..……163
The Distance of Beesly …………………………………………………………..….163
The Distance of Cavenar and Trenkle ……………………………………..………164
The Distance of Kullbach-Leibler (KL) ……………………………………….……164
The Distance of khi2 (χ 2) ……………………………………………………...…….164
6.2.4.2. Conventional methods used in categorization ………………………………..……165
6.3. Our proposed method …………………………………………………………………..165
6.4. Experimentations …………………………………………………………………...….166
6.4.1. Training and Test Corpus …………………………………………..……………….166
6.4.2. Pre-processing Performed on Training and Testing corpus …………….……………166
6.4.3. Performed Processing ………………………………………………………….…….166
6.5. Evaluation of obtained results ………………………………………………………….167
6.6. Conclusion and perspectives ……………………………………………………..…….169
Chapter 7: Contextual Categorization Using New Panoply of Similarity Metrics
7.1. Introduction ………………………………………………………….…………………171
7.2. State of the Art …………………………………………………………………………171
7.2.1. Language Identification and Documents Categorization ……………………………171
7.2.1.1. Language Identification ……………………………………………………………171
7.2.1.2. Automatic Categorization of Texts …………………………………………………171
7.2.2. Approaches of Texts Representation ………………………………………..………172
7.2.2.1. Representation with Bag of Words …………………………………………………172
7.2.2.2. Representation Based on N-grams …………………………………………………172
7.2.3. Methods of Texts Categorization …………………………………………….………173
7.2.3.1. Conventional Method ………………………………………………………………173
7.2.3.2. Nearest Neighbors Methods ………………………………………………..………173
7.2.4. Metrics of Similarity Used in Language Identification ………………………………173
7.3. Our Proposed Approach ……………………………………………………………..…174
7.3.1. Application of Metrics of Language Identification in Text Categorization ……….…174
7.3.2. Presentation of the New Method ………………………………………………..……174
7.4. Experimentations ………………………………………………………………………175
7.4.1. Training and Test Corpus …………………………………………………………… 175
7.4.2. Pre-processing Performed on Training and Testing corpus ……………….…………175
7.4.3. Performed Processing ………………………………………………………..………176
7.5. Evaluation of obtained Results ……………………………………………………...…176
7.5.1. Segmentation phase (Tokenization) ………………………………………..……… 176
7.5.2. Learning phase …………………………………………………………………….…177
7.5.3. Interpretation of the obtained results …………………………………...……………179
7.6. Conclusion and perspectives ………………………………………………...…………179
Chapter 8: Arabic Text Categorization: An Improved Stemming Algorithm to Increase
the Quality of Categorization (180 – 188)
8.1. Introduction ……………………………………………………………………...……..181
8.2. Related Work …………………………………………………………………………..182
8.3. The Proposed Algorithm ……………………………………………...………………..183
8.4. Experimentations and Obtained Results ………………………………...……………..184
8.4.1. The Used Dataset …………………………………………………………...………..184
8.4.2. The Obtained Results …………………………………………………….…………..185
8.5. Comparison with Other Algorithms ……………………………………………………187
8.6. Discussion ……………………………………………………………………….……..188
8.7. Conclusion and Perspectives ………………………………………………………...…188
Conclusion …………………………………………………………………….…..…189 – 190
Appendices …………………………………………………………………….…….191 – 195
Bibliography …………………………………………………………………………196 - 207Côte titre : DI/0023 En ligne : https://drive.google.com/file/d/1Qcmm0ks40JlBccKNEeSXc3Oun91tUKu3/view?usp=shari [...] Format de la ressource électronique : Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité DI/0023 DI/0023 Thèse Bibliothéque des sciences Français Disponible
DisponiblePermalinkDeep Belief Networks Applied to Alzheimer’s Disease Detection and Classification using Neuroimaging Data / Yacine Deradra
![]()
PermalinkDeep Feature Learning (Extraction and Generation) Using a Bidirectional LSTM-CNN and Deep Generative Models Applied to Physiological Signals (EEG/ECG) Classification / Hichem Betiche
![]()
PermalinkPermalinkDeep learning models for arrhythmia classification and coronary artery diseases detection / Khaoula Tobbal
![]()
PermalinkPermalinkPermalinkEnhancing Medical Image Segmentation with FastAI and Vision Transformers: A Hybrid Approach / Rahma Hebbir
PermalinkEnsemble Deep Learning-based Semantic Segmentation and Classification of Leaves Images for Agricultural Apple and Wheat Diseases’ Detection / Naryméne Kebiche
![]()
PermalinkPermalink