University Sétif 1 FERHAT ABBAS Faculty of Sciences
Détail de l'auteur
Auteur Gadri, said |
Documents disponibles écrits par cet auteur



Titre : Catégorisation Automatique Contextuelle de Documents Semi-structurés Multilingues Type de document : texte imprimé Auteurs : Gadri, said, Auteur ; Abdelouahab Moussaoui, Directeur de thèse Editeur : Setif:UFA Année de publication : 2016 Importance : 1 vol (207 f .) Format : 29 cm Catégories : Informatique Mots-clés : Catégorisation Automatique Contextuelle Documents Semi-structurés Multilingues Résumé : Résumé
La catégorisation de textes est une tache très importante dans le processus de text mining. Cette tache
consiste à affecter un ensemble de textes à un autre ensemble de catégories selon leurs thèmes et en
exploitant les algorithmes d’apprentissage connus dans le domaine d’intelligence artificielle. Notre
étude sur cet axe de recherche nous a permis de proposer quelques solutions et de porter certaines
contributions, notamment: Proposer un algorithme simple, rapide et efficace pour identifier la langue
d’un texte dans un corpus multilingue. Développer un algorithme amélioré pour la recherche des
racines des mots arabes en se basant sur une approche complètement statistique. L’objectif principal
de cet algorithme est de réduire la taille du vocabulaire de termes et par conséquent améliorer la
qualité de la catégorisation obtenue dans le domaine de la catégorisation de textes et augmenter
l’efficacité de la recherche dans le domaine de la recherche d’information. Développer un nouveau
stemmer multilingue qui est plus général et indépendant de toute langue. Application d’une nouvelle
panoplie de pseudo-distances pour catégoriser les textes d’un corpus de grande taille (Reuters21578).
Toutes ces solutions étaient l’objet de papiers scientifiques publiés dans des conférences et des
journaux internationaux indexés.Note de contenu : Table of Content
Aknowledgments I
Dedications II
Abstract III
Author Biography IV
Introduction 1 - 6
1. Scope of the work ………………………………………………………….……………….1
2. Problematic …………………………………………………………………...…………….2
3. Our contribution ………………………………………………………………………….2-3
3.1. Language identification …………………………………………………….…………….2
3.2. Contextual text categorization ………………………………………………...………….3
3.3. Arabic Stemming …………………………………………………………...…………….3
3.4. Multilingual Stemming …………………………………………………………..……….3
4. Thesis organization ……………………………………………………………...………….4
4.1. A theoretical part ………………………………………………………………………….4
4.2. Contributions part …………………………………………………………..…………….4
5. Author Publications ……………………………………………………………...………….4
Part I: Theoretical Part (1 – 160)
Chapter 1: Data Mining: Basic Concepts and Tools (7 – 40)
1.1.Introduction ……………………………………………………………………………..…7
1.2.Knowledge data discovery from data …………………………………………………...…8
1.3.Some popular definitions of Data Mining …………………………………………………8
1.4.What is Data Mining? An introductive Example …………………………………...……10
1.5.What is not Data Mining …………………………………………………………………11
1.6.Data mining and knowledge Discovery ………………………………….………………12
1.7.Where Data Mining can be placed? Origins ………………………………..……………12
1.8.Data mining motivations …………………………………………………………………14
1.9.Data mining Tasks ………………………………………………………….……………17
1.9.1. Predictive Tasks ………………………………………………………...……………17
1.9.2. Descriptive tasks …………………………………………………………..…………17
1.10. A classic example of data mining use …………………..……………………………17
1.11. Data mining applications ……………………………………………….……………18
1.11.1. Business Problems ……………………………………………………………...……18
1.11.2. Other problems for data mining …………………………………...…………………19
1.12. Principle tasks of Data Mining ………………………………………………….……19
1.12.1. Classification (predictive task) ………………………………………………………19
Some classification applications ………………………………………………………..……20
1.12.2. Clustering (descriptive Task) …………………………………………...……………21
Some clustering applications ………………………………………………………...………22
1.12.3. Association Rule Mining (descriptive task) …………………….……………………23
Some association rule discovery applications …………………………….…………………24
1.12.4. Regression (predictive task) …………………………………………………………25
Some regression applications …………………………………………………………..……25
1.12.5. Anomaly Detection/Deviation analysis (descriptive task) ……………..……………26
Some anomaly detection applications ………………………………………………..………26
1.12.6. Sequential Pattern Mining (descriptive task) ……………………………...…………26
Some pattern discovery applications …………………………………………………………27
1.12.7. Time Series Prediction (Forecasting/predictive task) ………………….……………28
1.12.8. Decision Making …………………………………………………………………..…28
1.13. Data Mining Project Cycle …………………………………………………...………29
1.14. Types of data sets used in Data mining field …………………………...……………33
1.15. Major Vendors and Products …………………………………………………………37
1.16. Text Mining, Web Mining, XML Mining: New applications of Data Mining……….38
1.16.1. Text Mining ……………………………………………………………………..……38
1.16.2. Web Mining ………………………………………………………………..…………38
1.16.3. XML Mining ……………………………………………………………….…………39
1.17. Future perspectives in Data Mining …………………………………….……………39
1.18. Summary …………………………………………………………………………..…40
Chapter 2: Fundamentals of text mining (41 – 70)
2.1. Introduction ……………………………………………………………………….….….42
2.2. Some definitions of Text Mining ……………………………….……………………….42
2.3. Data Mining Vs Text Mining ………………………………………………………...….42
2.4. Structured and unstructured data ………………………………………………..……….43
2.5. Why Text Mining- Motivation ………………………………………………….……….44
2.6. Where Text Mining can be placed? ………………………………………….………….44
2.7. Why Text Mining is hard? Major difficulties ……………………………………..…….45
2.8. Text Mining Applications ……………………………………………………………….46
2.8.1. Document classification ……………………………………………………………….46
2.8.2. Information retrieval ……………………………………………………….………….47
2.8.3. Clustering and organizing document ………………………………………………….48
2.8.4. Information extraction ……………………………………………………………..….49
2.9. Architecture of text mining systems ……………………………………………….……49
2.9.1. General architecture ……………………………………………………………….…49
2.9.2. Functional Architecture ………………………………………………………………50
2.10. Text Mining process step by step …………………………………………….………52
2.10.1. Collecting documents ……………………………………………………………..…52
2.10.2. Text Preprocessing tasks ………………………………………………………..……54
2.10.2.1. Document Standardization …………………………………………….…………54
2.10.2.2. Tokenization …………………………………………………………………...…54
2.10.2.3. Simple Syntactic Analysis ……………………………………….………………54
2.10.2.4. Advanced Linguistic Analysis …………………………………...………………55
a. Part Of Speech (POS) tagging ……………………………………………….………55
b. Syntactical parsing ………………………………………………………………...…57
c. Shallow Parsing ………………………………………………………………...……58
d. Word Sense Disambiguation …………………………………………………………59
2.10.2.5. Lemmatization and stemming ……………………………………………………59
a. Lookup-based Stemming ………………………………………………………..……60
b. Rule-base stemming (affix removal stemming) ………………………………………60
c. Inflectional stemming (lemmatization) ………………………………………………60
d. Stemming to a root ………………………………………………………………...…61
2.10.3. Feature Generation and data representation …………………….……………………61
2.10.3.1. Global dictionary vs. local Dictionary ………………………………...…………61
2.10.3.2. Features reduction ……………………………………………………………..…62
2.10.3.3. Data representation ………………………………………………………………62
a. Binary model …………………………………………………………………………62
b. Three values model ………………………………………………………..…………62
c. Term frequency model (tf) ……………………………………………………………63
d. Tf-idf model (Term Frequency-Inverse Document Frequency) …………...…………63
2.10.3.4. Multiword Features ………………………………………………………………64
2.10.3.5. Labels for the Right Answers ……………………………………….……………64
2.10.3.6. Named Entity Recognition (NER) …………………………………….…………65
2.10.4. Feature selection ……………………………………………………………...……65
2.10.5. Data Mining (pattern discovery) …………………………………………...………65
2.10.5.1. Classification ………………………………………………………………….…66
2.10.5.2. Clustering …………………………………………………………………..……66
a. Jaccard Coefficient ……………………………………………………..……………67
b. Cosine Similarity ………………………………………………………..……………68
c. Cosine Similarity and TF-IDF ………………………………….……………………68
2.10.5.3. Sentiment Analysis ……………………………………………………….………69
2.11. Summary …………………………………………………………………...………70
Chapter 3: Automatic Text Categorization (71 – 96)
3.1. Preface ……………………………………………………………………………...……72
3.2. Definition of the problem ……………………………………………..…………………73
3.2.1. Single-Label versus Multilabel Categorization …………………..……………………73
3.2.2. Document-Pivoted versus Category-Pivoted Categorization …………………………73
3.3. Applications of text categorization ………………………………………………...……73
3.3.1. Indexing of Texts Using Controlled Vocabulary …………………………………..…74
3.3.2. Document Sorting and Text Filtering …………………………………………………74
3.3.3. Hierarchical Web Page Categorization …………………………………………..……74
3.4. Particular difficulties of text categorization …………………………………………..…74
3.4.1. Big size of vectors …………………………………………………………………..…75
3.4.2. Imbalance of classes …………………………………………………………..………75
3.4.3. Ambiguity of terms …………………………………………………………….………75
3.4.4. Problem of synonymy ……………………………………………………….…………75
3.4.5. Subjectivity of decision ………………………………………………………...………75
3.5. How to categorize a monolingual text: the general process of TC ………..……………76
3.6. The problem of document representation ………………………………………….……77
3.6.1. Choice of document features (terms) …………………………………………….……77
3.6.1.1. Representation with bag of words ………………………………………………...…78
3.6.1.2. Representation with Sentences ………………………………………………………78
3.6.1.3. Representation with lexical roots (Stems) ……………………………………...……78
3.6.1.4. Representation with lemmas ……………………………………………………...…78
3.6.1.5. Representation Based on N-grams …………………………………………..………78
3.6.1.6. Conceptual representation ……………………………………………………..……79
3.6.2. Terms coding ……………………………………………………………………..……79
3.6.2.1. Binary model …………………………………………………………...……………79
3.6.2.2. Three values model ……………………………………………….…………………79
3.6.2.3. Term frequency code (tf) ……………………………………………………….……80
3.6.2.4. Tf_idf coding (Term Frequency-Inverse Document Frequency) ………………..…..80
a) Document Frequency (dft) …………………………………………………………...……80
b) Inverse Document Frequency (idft ) ………………………………………….……………81
c) tf-idf coding (Term Frequency-Inverse Document Frequency) ………………...…………81
d) Variants of tf and tf-idf weighting …………………………………………………………82
3.6.2.5. TFC coding ……………………………………………………………………….…82
3.6.2.6. LNU Coding …………………………………………………………………………82
3.6.2.7. The entropy …………………………………………………………..……………………….83
3.7. Feature reduction ………………………………………………………………….…83
3.7.1. Local reduction ………………………………………………………….……….……84
3.7.2. Global reduction ………………………………………………………………………84
3.8. Dimensionality reduction by feature selection ………………………………………84
3.8.1. Document Frequency DF …………………………………………………………...…84
3.8.2. Mutual Information (MI) …………………………………………………………...…85
3.8.3. Information Gain (IG) …………………………………………………………………85
3.8.4. χ
Statistic (Chi-square / Chi-2) ……………………………………….………………86
3.8.5. Weighted Log Likelihood Ratio (WLLR) …………………………….………………87
3.9. Dimensionality Reduction by Feature Extraction …………………………….…………87
3.10. Knowledge engineering approach to TC ………………………………………………87
3.11. Machine learning approach to TC …………………………………………..…………88
3.12. Using unlabeled data to improve classification ……………………………..…………88
3.13. Multilingual text categorization ……………………………………………….………89
3.13.1. Importance of multilingual categorization ………………………………..…………89
3.13.2. Multilingual information retrieval …………………………………………...………91
3.13.2.1.Approaches based on automatic translation ………………………………...………91
3.13.2.2.Approaches based on multilingual thesaurus ………………………………….……92
3.13.2.3.Approaches based on the use of dictionaries. ………………………………………92
3.13.3. Proposed solutions for multilingual text categorization ………………………..……92
3.13.3.1.Scheme 1: The trivial scheme ………………………………………………………92
3.13.3.2.Scheme 2: Using a single language for learning. …………………………...………93
3.13.3.3.Scheme 3: Mix the training sets. ……………………………………………………94
3.13.4. The main phases of multilingual text categorization ………………………...………95
3.13.4.1.Language Identification ………………………………………………………….…95
3.13.4.2.Automatic Translation ………………………………………………………………95
3.13.4.3.Text categorization ……………………………………………………………….…96
3.14. Summary ……………………………………………………………………………96
Chapter 4: Machine Learning Algorithms for text categorization (97 – 140)
4.1. Preface ………………………………………………………………………...…………98
4.2. The text classification problem ……………………………….…………………………98
4.3. Learning algorithms used in text categorization …………………...……………………99
4.3.1. Naïve Bayes classifier …………………………………………………………………99
4.3.1.1. The multinomial model ……………………………………………...………………99
4.3.1.2.The Bernoulli model ………………………………………………………..………103
4.3.1.3.Time complexity of NB classifier ………………………………………..…………105
4.3.1.4.Linear classifiers ……………………………………………………………………105
4.3.2. Rocchio classifier ……………………………………………………………..….…105
4.3.3. k nearest neighbor classifier ……………………………………………...…………108
4.3.3.1. Similarity measures used with kNN algorithm ……………………………….……109
4.3.3.2. Probabilistic kNN ………………………………………………………………..…110
4.3.3.3. Performance of kNN classifier …………………………………………………..…110
4.3.4. Support Vector Machine classifier (SVM classifier) …………………………….…111
4.3.4.1. The linearly separable case ……………………………………………………..…111
4.3.4.2. Nonlinearly separable case and noisy data ………………………………….……115
a) Large margin classification for noisy data ………………………………..………….…115
b) Multiclass SVM …………………………………………………………………………117
c) Nonlinear SVM ……………………………………………………………………….…117
d) What kinds of functions are valid kernel functions? ……………………………………119
e) The new optimizing problem ……………………………………………………………119
f) Examples of kernel functions ……………………………………………………….……120
4.3.4.3. Software Implementations …………………………………………………………121
4.3.5. Decision Tree classifiers. …………………………………………………….………122
4.3.5.1. Algorithm presentation …………………………………………………….………122
4.3.5.2. When we use decision tree learning? ………………………………………………124
4.3.5.3. ID3 algorithm ………………………………………………………………………124
a) Which Attribute Is the Best Classifier? …………………………………….……………124
b) How to create the decision tree using ID3 algorithm? ……………………….…………128
4.3.5.4. C4.5 algorithm ………………………………………………………………..……130
a) How C4.5 works? ………………………………………………………………..………130
b) How are tests chosen? …………………………………………………...………………131
c) How is tree-growing terminated? ………………………………..………………………131
d) How are class labels assigned to the leaves? ……………………………………………131
4.3.6. Decision Rule Classifiers ……………………………………………...……………131
4.3.7. Regression Methods …………………………………………………………...……132
4.3.8. Artificial Neural Networks …………………………………………………..………132
4.3.9. Classifier Committees: Bagging and Boosting ……………………………..………133
A general boosting procedure ………………………………………………...…….134
4.4. Improving classifier performance ……………………………………………….……135
4.5. Evaluation of text classifiers …………………………………………………….……136
4.6. Performance Measures …………………………………………………………..……136
4.6.1. Recall, Precision and F-measure …………………………………………..…………136
4.6.2. Noise and Silence ……………………………………………………………...……137
4.6.3. Micro and Macro Average ……………………………………………….…………138
4.7. Benchmark Collections …………………………………………………….…………139
4.8. Comparison among Classifiers ………………………………………….……………139
4.9. Summary …………………………………………………………………………...…140
Chapter 5: Categorization of Semistructured documents (141 – 160)
5.1. Introduction ……………………………………………………………………….……142
5.2. The web and HTML documents …………………………………….…………………142
5.3. XML Semistructured documents ………………………………………………………142
5.3.1. From a flat document to a structured document …………………………..…………142
5.3.2. XML language ……………………………………………………………………..…143
5.3.3. XML Document ………………………………………………………………………143
5.3.3.1. The DTD (Document Type Definition) ………………………………..……………144
5.3.3.2.XML DOM (XML Document Object Model) ……………………………..…………145
5.3.3.3.XPATH ………………………………………………………………..………….…145
5.3.3.4. Types of XML documents …………………………………….………………….…145
a. XML documents oriented texts ……………………………………………………..……145
b. XML documents oriented data …………………………………………………..………146
5.3.4. Semantic of tags …………………………………………………………………..…146
5.3.4.1. Hard tags ……………………………………………………………………..……146
5.3.4.2. Soft tag ……………………………………………………………………..………147
5.3.4.3.Jump tags …………………………………………………………………...………147
5.4. XML Mining …………………………………………………………………...………147
5.5. Structured information retrieval ……………………………..…………………………148
5.5.1. Problems related to the representation ………………………………………………148
5.5.2. The need to adapt the old models ……………………………….……………………148
5.5.3. The INEX initiative and the proposed solutions ………………………..……………149
5.5.3.1. The CO Queries (Content Only) ……………………………………………...……149
5.5.3. 2. CAS queries (Content and Structure) ………………………………………...……149
5.5.3.3. Evaluation of structured RI system ……………………………………………...…149
a. The component coverage dimension ………………………………………………..……149
b. The topical relevance dimension …………………………………………………………149
5.5.4. Problem of semi-structured documents heterogeneity …………………………….…150
5.5.5. Querying and heterogeneity. …………………………………………………………151
5.5.6. Conversion of document formats ………………………………………………….…153
5.6. Categorization of XML Semistructured documents ……………………………...……153
5.6.1. Approaches based on the structure and the content …………………………….……153
5.6.1.1.[Yang and Zhang, 2007] approach …………………………………………………153
5.6.1.2. [de compos et al, 2007] approach …………………………………………………154
5.6.2. Approaches based on the structure only ………………………………………..……157
5.6.2.1. [Zaki and Aggarwal, 2003] approach ………………………………….…………157
5.6.2.2. [Garboni et al., 2005] ………………………………………………………...……158
5.7. Summary ……………………………………………………………………………….160
Part II: Contribution Part
Chapter 6: Identifying the language of a text in a collection of multilingual documents
6.1. Introduction …………………………………………………………………………….162
6.2. State of the art ………………………………………………………………………….162
6.2.1. Language Identification Approaches …………………………...……………………162
6.2.1.1. The linguistic approach …………………………………………...……………….162
6.2.1.2. The lexical approach ……………………………………………………………….162
6.2.1.3. The grammatical approach ………………………………………….…………….162
6.2.1.4. The statistical approach …………………………………………………..……….162
6.2.2. Principle of Text Segmentation into N-grams of Characters ………………...………163
6.2.3. Advantages of Text Segmentation into N-grams …………………………….………163
6.2.4. Methods Based on N-grams for Language Identification ……………………………163
6.2.4.1. Nearest Neighbors Methods …………………………………………………..……163
The Distance of Beesly …………………………………………………………..….163
The Distance of Cavenar and Trenkle ……………………………………..………164
The Distance of Kullbach-Leibler (KL) ……………………………………….……164
The Distance of khi2 (χ 2) ……………………………………………………...…….164
6.2.4.2. Conventional methods used in categorization ………………………………..……165
6.3. Our proposed method …………………………………………………………………..165
6.4. Experimentations …………………………………………………………………...….166
6.4.1. Training and Test Corpus …………………………………………..……………….166
6.4.2. Pre-processing Performed on Training and Testing corpus …………….……………166
6.4.3. Performed Processing ………………………………………………………….…….166
6.5. Evaluation of obtained results ………………………………………………………….167
6.6. Conclusion and perspectives ……………………………………………………..…….169
Chapter 7: Contextual Categorization Using New Panoply of Similarity Metrics
7.1. Introduction ………………………………………………………….…………………171
7.2. State of the Art …………………………………………………………………………171
7.2.1. Language Identification and Documents Categorization ……………………………171
7.2.1.1. Language Identification ……………………………………………………………171
7.2.1.2. Automatic Categorization of Texts …………………………………………………171
7.2.2. Approaches of Texts Representation ………………………………………..………172
7.2.2.1. Representation with Bag of Words …………………………………………………172
7.2.2.2. Representation Based on N-grams …………………………………………………172
7.2.3. Methods of Texts Categorization …………………………………………….………173
7.2.3.1. Conventional Method ………………………………………………………………173
7.2.3.2. Nearest Neighbors Methods ………………………………………………..………173
7.2.4. Metrics of Similarity Used in Language Identification ………………………………173
7.3. Our Proposed Approach ……………………………………………………………..…174
7.3.1. Application of Metrics of Language Identification in Text Categorization ……….…174
7.3.2. Presentation of the New Method ………………………………………………..……174
7.4. Experimentations ………………………………………………………………………175
7.4.1. Training and Test Corpus …………………………………………………………… 175
7.4.2. Pre-processing Performed on Training and Testing corpus ……………….…………175
7.4.3. Performed Processing ………………………………………………………..………176
7.5. Evaluation of obtained Results ……………………………………………………...…176
7.5.1. Segmentation phase (Tokenization) ………………………………………..……… 176
7.5.2. Learning phase …………………………………………………………………….…177
7.5.3. Interpretation of the obtained results …………………………………...……………179
7.6. Conclusion and perspectives ………………………………………………...…………179
Chapter 8: Arabic Text Categorization: An Improved Stemming Algorithm to Increase
the Quality of Categorization (180 – 188)
8.1. Introduction ……………………………………………………………………...……..181
8.2. Related Work …………………………………………………………………………..182
8.3. The Proposed Algorithm ……………………………………………...………………..183
8.4. Experimentations and Obtained Results ………………………………...……………..184
8.4.1. The Used Dataset …………………………………………………………...………..184
8.4.2. The Obtained Results …………………………………………………….…………..185
8.5. Comparison with Other Algorithms ……………………………………………………187
8.6. Discussion ……………………………………………………………………….……..188
8.7. Conclusion and Perspectives ………………………………………………………...…188
Conclusion …………………………………………………………………….…..…189 – 190
Appendices …………………………………………………………………….…….191 – 195
Bibliography …………………………………………………………………………196 - 207Côte titre : DI/0023 En ligne : https://drive.google.com/file/d/1Qcmm0ks40JlBccKNEeSXc3Oun91tUKu3/view?usp=shari [...] Format de la ressource électronique : Catégorisation Automatique Contextuelle de Documents Semi-structurés Multilingues [texte imprimé] / Gadri, said, Auteur ; Abdelouahab Moussaoui, Directeur de thèse . - [S.l.] : Setif:UFA, 2016 . - 1 vol (207 f .) ; 29 cm.
Catégories : Informatique Mots-clés : Catégorisation Automatique Contextuelle Documents Semi-structurés Multilingues Résumé : Résumé
La catégorisation de textes est une tache très importante dans le processus de text mining. Cette tache
consiste à affecter un ensemble de textes à un autre ensemble de catégories selon leurs thèmes et en
exploitant les algorithmes d’apprentissage connus dans le domaine d’intelligence artificielle. Notre
étude sur cet axe de recherche nous a permis de proposer quelques solutions et de porter certaines
contributions, notamment: Proposer un algorithme simple, rapide et efficace pour identifier la langue
d’un texte dans un corpus multilingue. Développer un algorithme amélioré pour la recherche des
racines des mots arabes en se basant sur une approche complètement statistique. L’objectif principal
de cet algorithme est de réduire la taille du vocabulaire de termes et par conséquent améliorer la
qualité de la catégorisation obtenue dans le domaine de la catégorisation de textes et augmenter
l’efficacité de la recherche dans le domaine de la recherche d’information. Développer un nouveau
stemmer multilingue qui est plus général et indépendant de toute langue. Application d’une nouvelle
panoplie de pseudo-distances pour catégoriser les textes d’un corpus de grande taille (Reuters21578).
Toutes ces solutions étaient l’objet de papiers scientifiques publiés dans des conférences et des
journaux internationaux indexés.Note de contenu : Table of Content
Aknowledgments I
Dedications II
Abstract III
Author Biography IV
Introduction 1 - 6
1. Scope of the work ………………………………………………………….……………….1
2. Problematic …………………………………………………………………...…………….2
3. Our contribution ………………………………………………………………………….2-3
3.1. Language identification …………………………………………………….…………….2
3.2. Contextual text categorization ………………………………………………...………….3
3.3. Arabic Stemming …………………………………………………………...…………….3
3.4. Multilingual Stemming …………………………………………………………..……….3
4. Thesis organization ……………………………………………………………...………….4
4.1. A theoretical part ………………………………………………………………………….4
4.2. Contributions part …………………………………………………………..…………….4
5. Author Publications ……………………………………………………………...………….4
Part I: Theoretical Part (1 – 160)
Chapter 1: Data Mining: Basic Concepts and Tools (7 – 40)
1.1.Introduction ……………………………………………………………………………..…7
1.2.Knowledge data discovery from data …………………………………………………...…8
1.3.Some popular definitions of Data Mining …………………………………………………8
1.4.What is Data Mining? An introductive Example …………………………………...……10
1.5.What is not Data Mining …………………………………………………………………11
1.6.Data mining and knowledge Discovery ………………………………….………………12
1.7.Where Data Mining can be placed? Origins ………………………………..……………12
1.8.Data mining motivations …………………………………………………………………14
1.9.Data mining Tasks ………………………………………………………….……………17
1.9.1. Predictive Tasks ………………………………………………………...……………17
1.9.2. Descriptive tasks …………………………………………………………..…………17
1.10. A classic example of data mining use …………………..……………………………17
1.11. Data mining applications ……………………………………………….……………18
1.11.1. Business Problems ……………………………………………………………...……18
1.11.2. Other problems for data mining …………………………………...…………………19
1.12. Principle tasks of Data Mining ………………………………………………….……19
1.12.1. Classification (predictive task) ………………………………………………………19
Some classification applications ………………………………………………………..……20
1.12.2. Clustering (descriptive Task) …………………………………………...……………21
Some clustering applications ………………………………………………………...………22
1.12.3. Association Rule Mining (descriptive task) …………………….……………………23
Some association rule discovery applications …………………………….…………………24
1.12.4. Regression (predictive task) …………………………………………………………25
Some regression applications …………………………………………………………..……25
1.12.5. Anomaly Detection/Deviation analysis (descriptive task) ……………..……………26
Some anomaly detection applications ………………………………………………..………26
1.12.6. Sequential Pattern Mining (descriptive task) ……………………………...…………26
Some pattern discovery applications …………………………………………………………27
1.12.7. Time Series Prediction (Forecasting/predictive task) ………………….……………28
1.12.8. Decision Making …………………………………………………………………..…28
1.13. Data Mining Project Cycle …………………………………………………...………29
1.14. Types of data sets used in Data mining field …………………………...……………33
1.15. Major Vendors and Products …………………………………………………………37
1.16. Text Mining, Web Mining, XML Mining: New applications of Data Mining……….38
1.16.1. Text Mining ……………………………………………………………………..……38
1.16.2. Web Mining ………………………………………………………………..…………38
1.16.3. XML Mining ……………………………………………………………….…………39
1.17. Future perspectives in Data Mining …………………………………….……………39
1.18. Summary …………………………………………………………………………..…40
Chapter 2: Fundamentals of text mining (41 – 70)
2.1. Introduction ……………………………………………………………………….….….42
2.2. Some definitions of Text Mining ……………………………….……………………….42
2.3. Data Mining Vs Text Mining ………………………………………………………...….42
2.4. Structured and unstructured data ………………………………………………..……….43
2.5. Why Text Mining- Motivation ………………………………………………….……….44
2.6. Where Text Mining can be placed? ………………………………………….………….44
2.7. Why Text Mining is hard? Major difficulties ……………………………………..…….45
2.8. Text Mining Applications ……………………………………………………………….46
2.8.1. Document classification ……………………………………………………………….46
2.8.2. Information retrieval ……………………………………………………….………….47
2.8.3. Clustering and organizing document ………………………………………………….48
2.8.4. Information extraction ……………………………………………………………..….49
2.9. Architecture of text mining systems ……………………………………………….……49
2.9.1. General architecture ……………………………………………………………….…49
2.9.2. Functional Architecture ………………………………………………………………50
2.10. Text Mining process step by step …………………………………………….………52
2.10.1. Collecting documents ……………………………………………………………..…52
2.10.2. Text Preprocessing tasks ………………………………………………………..……54
2.10.2.1. Document Standardization …………………………………………….…………54
2.10.2.2. Tokenization …………………………………………………………………...…54
2.10.2.3. Simple Syntactic Analysis ……………………………………….………………54
2.10.2.4. Advanced Linguistic Analysis …………………………………...………………55
a. Part Of Speech (POS) tagging ……………………………………………….………55
b. Syntactical parsing ………………………………………………………………...…57
c. Shallow Parsing ………………………………………………………………...……58
d. Word Sense Disambiguation …………………………………………………………59
2.10.2.5. Lemmatization and stemming ……………………………………………………59
a. Lookup-based Stemming ………………………………………………………..……60
b. Rule-base stemming (affix removal stemming) ………………………………………60
c. Inflectional stemming (lemmatization) ………………………………………………60
d. Stemming to a root ………………………………………………………………...…61
2.10.3. Feature Generation and data representation …………………….……………………61
2.10.3.1. Global dictionary vs. local Dictionary ………………………………...…………61
2.10.3.2. Features reduction ……………………………………………………………..…62
2.10.3.3. Data representation ………………………………………………………………62
a. Binary model …………………………………………………………………………62
b. Three values model ………………………………………………………..…………62
c. Term frequency model (tf) ……………………………………………………………63
d. Tf-idf model (Term Frequency-Inverse Document Frequency) …………...…………63
2.10.3.4. Multiword Features ………………………………………………………………64
2.10.3.5. Labels for the Right Answers ……………………………………….……………64
2.10.3.6. Named Entity Recognition (NER) …………………………………….…………65
2.10.4. Feature selection ……………………………………………………………...……65
2.10.5. Data Mining (pattern discovery) …………………………………………...………65
2.10.5.1. Classification ………………………………………………………………….…66
2.10.5.2. Clustering …………………………………………………………………..……66
a. Jaccard Coefficient ……………………………………………………..……………67
b. Cosine Similarity ………………………………………………………..……………68
c. Cosine Similarity and TF-IDF ………………………………….……………………68
2.10.5.3. Sentiment Analysis ……………………………………………………….………69
2.11. Summary …………………………………………………………………...………70
Chapter 3: Automatic Text Categorization (71 – 96)
3.1. Preface ……………………………………………………………………………...……72
3.2. Definition of the problem ……………………………………………..…………………73
3.2.1. Single-Label versus Multilabel Categorization …………………..……………………73
3.2.2. Document-Pivoted versus Category-Pivoted Categorization …………………………73
3.3. Applications of text categorization ………………………………………………...……73
3.3.1. Indexing of Texts Using Controlled Vocabulary …………………………………..…74
3.3.2. Document Sorting and Text Filtering …………………………………………………74
3.3.3. Hierarchical Web Page Categorization …………………………………………..……74
3.4. Particular difficulties of text categorization …………………………………………..…74
3.4.1. Big size of vectors …………………………………………………………………..…75
3.4.2. Imbalance of classes …………………………………………………………..………75
3.4.3. Ambiguity of terms …………………………………………………………….………75
3.4.4. Problem of synonymy ……………………………………………………….…………75
3.4.5. Subjectivity of decision ………………………………………………………...………75
3.5. How to categorize a monolingual text: the general process of TC ………..……………76
3.6. The problem of document representation ………………………………………….……77
3.6.1. Choice of document features (terms) …………………………………………….……77
3.6.1.1. Representation with bag of words ………………………………………………...…78
3.6.1.2. Representation with Sentences ………………………………………………………78
3.6.1.3. Representation with lexical roots (Stems) ……………………………………...……78
3.6.1.4. Representation with lemmas ……………………………………………………...…78
3.6.1.5. Representation Based on N-grams …………………………………………..………78
3.6.1.6. Conceptual representation ……………………………………………………..……79
3.6.2. Terms coding ……………………………………………………………………..……79
3.6.2.1. Binary model …………………………………………………………...……………79
3.6.2.2. Three values model ……………………………………………….…………………79
3.6.2.3. Term frequency code (tf) ……………………………………………………….……80
3.6.2.4. Tf_idf coding (Term Frequency-Inverse Document Frequency) ………………..…..80
a) Document Frequency (dft) …………………………………………………………...……80
b) Inverse Document Frequency (idft ) ………………………………………….……………81
c) tf-idf coding (Term Frequency-Inverse Document Frequency) ………………...…………81
d) Variants of tf and tf-idf weighting …………………………………………………………82
3.6.2.5. TFC coding ……………………………………………………………………….…82
3.6.2.6. LNU Coding …………………………………………………………………………82
3.6.2.7. The entropy …………………………………………………………..……………………….83
3.7. Feature reduction ………………………………………………………………….…83
3.7.1. Local reduction ………………………………………………………….……….……84
3.7.2. Global reduction ………………………………………………………………………84
3.8. Dimensionality reduction by feature selection ………………………………………84
3.8.1. Document Frequency DF …………………………………………………………...…84
3.8.2. Mutual Information (MI) …………………………………………………………...…85
3.8.3. Information Gain (IG) …………………………………………………………………85
3.8.4. χ
Statistic (Chi-square / Chi-2) ……………………………………….………………86
3.8.5. Weighted Log Likelihood Ratio (WLLR) …………………………….………………87
3.9. Dimensionality Reduction by Feature Extraction …………………………….…………87
3.10. Knowledge engineering approach to TC ………………………………………………87
3.11. Machine learning approach to TC …………………………………………..…………88
3.12. Using unlabeled data to improve classification ……………………………..…………88
3.13. Multilingual text categorization ……………………………………………….………89
3.13.1. Importance of multilingual categorization ………………………………..…………89
3.13.2. Multilingual information retrieval …………………………………………...………91
3.13.2.1.Approaches based on automatic translation ………………………………...………91
3.13.2.2.Approaches based on multilingual thesaurus ………………………………….……92
3.13.2.3.Approaches based on the use of dictionaries. ………………………………………92
3.13.3. Proposed solutions for multilingual text categorization ………………………..……92
3.13.3.1.Scheme 1: The trivial scheme ………………………………………………………92
3.13.3.2.Scheme 2: Using a single language for learning. …………………………...………93
3.13.3.3.Scheme 3: Mix the training sets. ……………………………………………………94
3.13.4. The main phases of multilingual text categorization ………………………...………95
3.13.4.1.Language Identification ………………………………………………………….…95
3.13.4.2.Automatic Translation ………………………………………………………………95
3.13.4.3.Text categorization ……………………………………………………………….…96
3.14. Summary ……………………………………………………………………………96
Chapter 4: Machine Learning Algorithms for text categorization (97 – 140)
4.1. Preface ………………………………………………………………………...…………98
4.2. The text classification problem ……………………………….…………………………98
4.3. Learning algorithms used in text categorization …………………...……………………99
4.3.1. Naïve Bayes classifier …………………………………………………………………99
4.3.1.1. The multinomial model ……………………………………………...………………99
4.3.1.2.The Bernoulli model ………………………………………………………..………103
4.3.1.3.Time complexity of NB classifier ………………………………………..…………105
4.3.1.4.Linear classifiers ……………………………………………………………………105
4.3.2. Rocchio classifier ……………………………………………………………..….…105
4.3.3. k nearest neighbor classifier ……………………………………………...…………108
4.3.3.1. Similarity measures used with kNN algorithm ……………………………….……109
4.3.3.2. Probabilistic kNN ………………………………………………………………..…110
4.3.3.3. Performance of kNN classifier …………………………………………………..…110
4.3.4. Support Vector Machine classifier (SVM classifier) …………………………….…111
4.3.4.1. The linearly separable case ……………………………………………………..…111
4.3.4.2. Nonlinearly separable case and noisy data ………………………………….……115
a) Large margin classification for noisy data ………………………………..………….…115
b) Multiclass SVM …………………………………………………………………………117
c) Nonlinear SVM ……………………………………………………………………….…117
d) What kinds of functions are valid kernel functions? ……………………………………119
e) The new optimizing problem ……………………………………………………………119
f) Examples of kernel functions ……………………………………………………….……120
4.3.4.3. Software Implementations …………………………………………………………121
4.3.5. Decision Tree classifiers. …………………………………………………….………122
4.3.5.1. Algorithm presentation …………………………………………………….………122
4.3.5.2. When we use decision tree learning? ………………………………………………124
4.3.5.3. ID3 algorithm ………………………………………………………………………124
a) Which Attribute Is the Best Classifier? …………………………………….……………124
b) How to create the decision tree using ID3 algorithm? ……………………….…………128
4.3.5.4. C4.5 algorithm ………………………………………………………………..……130
a) How C4.5 works? ………………………………………………………………..………130
b) How are tests chosen? …………………………………………………...………………131
c) How is tree-growing terminated? ………………………………..………………………131
d) How are class labels assigned to the leaves? ……………………………………………131
4.3.6. Decision Rule Classifiers ……………………………………………...……………131
4.3.7. Regression Methods …………………………………………………………...……132
4.3.8. Artificial Neural Networks …………………………………………………..………132
4.3.9. Classifier Committees: Bagging and Boosting ……………………………..………133
A general boosting procedure ………………………………………………...…….134
4.4. Improving classifier performance ……………………………………………….……135
4.5. Evaluation of text classifiers …………………………………………………….……136
4.6. Performance Measures …………………………………………………………..……136
4.6.1. Recall, Precision and F-measure …………………………………………..…………136
4.6.2. Noise and Silence ……………………………………………………………...……137
4.6.3. Micro and Macro Average ……………………………………………….…………138
4.7. Benchmark Collections …………………………………………………….…………139
4.8. Comparison among Classifiers ………………………………………….……………139
4.9. Summary …………………………………………………………………………...…140
Chapter 5: Categorization of Semistructured documents (141 – 160)
5.1. Introduction ……………………………………………………………………….……142
5.2. The web and HTML documents …………………………………….…………………142
5.3. XML Semistructured documents ………………………………………………………142
5.3.1. From a flat document to a structured document …………………………..…………142
5.3.2. XML language ……………………………………………………………………..…143
5.3.3. XML Document ………………………………………………………………………143
5.3.3.1. The DTD (Document Type Definition) ………………………………..……………144
5.3.3.2.XML DOM (XML Document Object Model) ……………………………..…………145
5.3.3.3.XPATH ………………………………………………………………..………….…145
5.3.3.4. Types of XML documents …………………………………….………………….…145
a. XML documents oriented texts ……………………………………………………..……145
b. XML documents oriented data …………………………………………………..………146
5.3.4. Semantic of tags …………………………………………………………………..…146
5.3.4.1. Hard tags ……………………………………………………………………..……146
5.3.4.2. Soft tag ……………………………………………………………………..………147
5.3.4.3.Jump tags …………………………………………………………………...………147
5.4. XML Mining …………………………………………………………………...………147
5.5. Structured information retrieval ……………………………..…………………………148
5.5.1. Problems related to the representation ………………………………………………148
5.5.2. The need to adapt the old models ……………………………….……………………148
5.5.3. The INEX initiative and the proposed solutions ………………………..……………149
5.5.3.1. The CO Queries (Content Only) ……………………………………………...……149
5.5.3. 2. CAS queries (Content and Structure) ………………………………………...……149
5.5.3.3. Evaluation of structured RI system ……………………………………………...…149
a. The component coverage dimension ………………………………………………..……149
b. The topical relevance dimension …………………………………………………………149
5.5.4. Problem of semi-structured documents heterogeneity …………………………….…150
5.5.5. Querying and heterogeneity. …………………………………………………………151
5.5.6. Conversion of document formats ………………………………………………….…153
5.6. Categorization of XML Semistructured documents ……………………………...……153
5.6.1. Approaches based on the structure and the content …………………………….……153
5.6.1.1.[Yang and Zhang, 2007] approach …………………………………………………153
5.6.1.2. [de compos et al, 2007] approach …………………………………………………154
5.6.2. Approaches based on the structure only ………………………………………..……157
5.6.2.1. [Zaki and Aggarwal, 2003] approach ………………………………….…………157
5.6.2.2. [Garboni et al., 2005] ………………………………………………………...……158
5.7. Summary ……………………………………………………………………………….160
Part II: Contribution Part
Chapter 6: Identifying the language of a text in a collection of multilingual documents
6.1. Introduction …………………………………………………………………………….162
6.2. State of the art ………………………………………………………………………….162
6.2.1. Language Identification Approaches …………………………...……………………162
6.2.1.1. The linguistic approach …………………………………………...……………….162
6.2.1.2. The lexical approach ……………………………………………………………….162
6.2.1.3. The grammatical approach ………………………………………….…………….162
6.2.1.4. The statistical approach …………………………………………………..……….162
6.2.2. Principle of Text Segmentation into N-grams of Characters ………………...………163
6.2.3. Advantages of Text Segmentation into N-grams …………………………….………163
6.2.4. Methods Based on N-grams for Language Identification ……………………………163
6.2.4.1. Nearest Neighbors Methods …………………………………………………..……163
The Distance of Beesly …………………………………………………………..….163
The Distance of Cavenar and Trenkle ……………………………………..………164
The Distance of Kullbach-Leibler (KL) ……………………………………….……164
The Distance of khi2 (χ 2) ……………………………………………………...…….164
6.2.4.2. Conventional methods used in categorization ………………………………..……165
6.3. Our proposed method …………………………………………………………………..165
6.4. Experimentations …………………………………………………………………...….166
6.4.1. Training and Test Corpus …………………………………………..……………….166
6.4.2. Pre-processing Performed on Training and Testing corpus …………….……………166
6.4.3. Performed Processing ………………………………………………………….…….166
6.5. Evaluation of obtained results ………………………………………………………….167
6.6. Conclusion and perspectives ……………………………………………………..…….169
Chapter 7: Contextual Categorization Using New Panoply of Similarity Metrics
7.1. Introduction ………………………………………………………….…………………171
7.2. State of the Art …………………………………………………………………………171
7.2.1. Language Identification and Documents Categorization ……………………………171
7.2.1.1. Language Identification ……………………………………………………………171
7.2.1.2. Automatic Categorization of Texts …………………………………………………171
7.2.2. Approaches of Texts Representation ………………………………………..………172
7.2.2.1. Representation with Bag of Words …………………………………………………172
7.2.2.2. Representation Based on N-grams …………………………………………………172
7.2.3. Methods of Texts Categorization …………………………………………….………173
7.2.3.1. Conventional Method ………………………………………………………………173
7.2.3.2. Nearest Neighbors Methods ………………………………………………..………173
7.2.4. Metrics of Similarity Used in Language Identification ………………………………173
7.3. Our Proposed Approach ……………………………………………………………..…174
7.3.1. Application of Metrics of Language Identification in Text Categorization ……….…174
7.3.2. Presentation of the New Method ………………………………………………..……174
7.4. Experimentations ………………………………………………………………………175
7.4.1. Training and Test Corpus …………………………………………………………… 175
7.4.2. Pre-processing Performed on Training and Testing corpus ……………….…………175
7.4.3. Performed Processing ………………………………………………………..………176
7.5. Evaluation of obtained Results ……………………………………………………...…176
7.5.1. Segmentation phase (Tokenization) ………………………………………..……… 176
7.5.2. Learning phase …………………………………………………………………….…177
7.5.3. Interpretation of the obtained results …………………………………...……………179
7.6. Conclusion and perspectives ………………………………………………...…………179
Chapter 8: Arabic Text Categorization: An Improved Stemming Algorithm to Increase
the Quality of Categorization (180 – 188)
8.1. Introduction ……………………………………………………………………...……..181
8.2. Related Work …………………………………………………………………………..182
8.3. The Proposed Algorithm ……………………………………………...………………..183
8.4. Experimentations and Obtained Results ………………………………...……………..184
8.4.1. The Used Dataset …………………………………………………………...………..184
8.4.2. The Obtained Results …………………………………………………….…………..185
8.5. Comparison with Other Algorithms ……………………………………………………187
8.6. Discussion ……………………………………………………………………….……..188
8.7. Conclusion and Perspectives ………………………………………………………...…188
Conclusion …………………………………………………………………….…..…189 – 190
Appendices …………………………………………………………………….…….191 – 195
Bibliography …………………………………………………………………………196 - 207Côte titre : DI/0023 En ligne : https://drive.google.com/file/d/1Qcmm0ks40JlBccKNEeSXc3Oun91tUKu3/view?usp=shari [...] Format de la ressource électronique : Exemplaires (1)
Code-barres Cote Support Localisation Section Disponibilité DI/0023 DI/0023 Thèse Bibliothéque des sciences Français Disponible
Disponible