Catalogue en ligne

University Sétif 1 FERHAT ABBAS Faculty of Sciences

Nouvelle recherche

Document: texte imprimé

Complexité lexicale des textes Arabes par les techniques d’apprentissage automatique / Chenni,Ghozlene

pdf

Public
ISBD

Titre :	Complexité lexicale des textes Arabes par les techniques d’apprentissage automatique
Type de document :	texte imprimé
Auteurs :	Chenni,Ghozlene, Auteur ; Sadik Bessou, Directeur de thèse
Editeur :	Setif:UFA
Année de publication :	2019
Langues :	Français (fre)
Catégories :	Thèses & Mémoires:Informatique
Mots-clés :	Traitement du langage naturel Classification du texte arabe complexité lexicale Arabe Extraction de caractéristiques Apprentissage automatique
Index. décimale :	004 Informatique
Résumé :	La langue arabe est l’une des langues les plus anciennes et les plus complexes du monde, mais elle existe encore jusqu’à présent. En raison de la complexité de cette langue, elle présente des défis pour de nombreuses applications de traitement en langage naturel. Dans ce mémoire, nous présentons les détails de la collecte et de la construction d’un grand ensemble de données "corpus" de textes arabes. Les techniques utilisées pour le prétraitement des données collectées sont expliquées. Nous présentons nos quatre classes: ancienne, islamique, récente et enfantine. Différents algorithmes d’apprentissage automatique ont été utilisés pour classer les textes: Bayes naïves multinomiales, Bernoulli Naive Bayes, Régression logistique, Support Vector Machine, et Random Forest. Et un modèle N-gram a été proposé où les documents sont classés sur la base de: everygrams, unigrammes, bigrams, unigrammes et bigrams ensemble. Les meilleurs résultats de la précision que nous avons obtenue en utilisant Countvectorizer était 86,47% avec le classificateur Bayes Naive Multinomial, et 87,2% en utilisant Tfidfvectorizer avec le classificateur Support Vector Machine en utilisant everygrams .
Note de contenu :	Sommaire Abstract ii Acknowledgements v 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Natural Language Processing 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . . 3 2.3 NLP components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Levels of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.1 Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.2 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4.3 Lexical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.4 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.5 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.6 Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.7 Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Natural Language Processing Applications . . . . . . . . . . . . . . 9 2.5.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . 9 2.5.2 Speech Recognition (SR) . . . . . . . . . . . . . . . . . . . . . 10 2.5.3 Information Extraction (IE) . . . . . . . . . . . . . . . . . . . 10 vii 2.5.4 Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.5 Question-Answering . . . . . . . . . . . . . . . . . . . . . . 12 2.5.6 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.7 Machine Translation . . . . . . . . . . . . . . . . . . . . . . 13 2.5.8 Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.9 Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.10 Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Machine Learning 16 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Alan Turing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Arthur Samuel . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.3 Tom Mitchell . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.4 Sebastian Raschka . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Machine Learning Process . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Machine Learning Categories . . . . . . . . . . . . . . . . . . . . . . 19 3.5.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . 20 3.5.1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . 20 3.5.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . 21 3.5.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.2.2 Dimensionality Reduction . . . . . . . . . . . . . . 22 3.5.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 22 3.6 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 23 3.6.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . 23 3.6.1.1 Multinomial Naive Bayes . . . . . . . . . . . . . . . 24 viii 3.6.1.2 Bernoulli Naive Bayes . . . . . . . . . . . . . . . . . 25 3.6.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6.2.1 Decision Tree Representation . . . . . . . . . . . . . 25 3.6.3 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.3.1 The General Idea . . . . . . . . . . . . . . . . . . . . 27 3.6.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 28 3.6.5.1 Basic concept . . . . . . . . . . . . . . . . . . . . . . 29 3.6.5.2 Linear Support Vector Machines . . . . . . . . . . . 30 3.6.5.3 The Non-Separable Case . . . . . . . . . . . . . . . 31 3.6.6 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . 33 3.6.6.1 Basics Of Artificial Neural Networks . . . . . . . . 33 3.6.6.2 Neural Networks Types . . . . . . . . . . . . . . . . 34 3.6.6.3 Activation Functions . . . . . . . . . . . . . . . . . 35 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Arabic Lexical Complexity 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 The Arabic Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Characteristics of Arabic Language . . . . . . . . . . . . . . . . . . . 39 4.4 Arabic Text classification . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Arabic complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5 Datasets And Implementation Frameworks 43 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Proposed System implementation . . . . . . . . . . . . . . . . . . . . 43 5.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.1.1 Dataset statistics . . . . . . . . . . . . . . . . . . . . 46 5.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 47 ix 5.2.3 Training and test sets . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.4 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.4.1 CountVectorizer . . . . . . . . . . . . . . . . . . . . 52 5.2.4.2 TfidfVectorizer . . . . . . . . . . . . . . . . . . . . . 53 5.2.4.3 N-grams . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Implementation Frameworks . . . . . . . . . . . . . . . . . . . . . . 56 5.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.3 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.4 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.5 NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.6 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.7 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.8 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.9 Seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6 Results And Discussion 60 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Evaluation metrics of performance . . . . . . . . . . . . . . . . . . . 60 6.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2.4 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3 Results and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.1 Results using CountVectorizer . . . . . . . . . . . . . . . . . 63 6.3.1.1 Using everygrams . . . . . . . . . . . . . . . . . . . 63 6.3.1.2 Using Unigrams . . . . . . . . . . . . . . . . . . . . 64 x 6.3.1.3 Using Bigrams . . . . . . . . . . . . . . . . . . . . . 65 6.3.1.4 Using Unigrams and Bigrams . . . . . . . . . . . . 66 6.3.2 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3.3 Results using TfidfVectorizer . . . . . . . . . . . . . . . . . . 68 6.3.3.1 Using everygrams . . . . . . . . . . . . . . . . . . . 68 6.3.3.2 Using Unigrams . . . . . . . . . . . . . . . . . . . . 69 6.3.3.3 Using Bigrams . . . . . . . . . . . . . . . . . . . . . 70 6.3.3.4 Using Unigrams and Bigrams . . . . . . . . . . . . 71 6.3.4 summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3.5 Testing the classifier . . . . . . . . . . . . . . . . . . . . . . . 73 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 Conclusion 75 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Bibliography
Côte titre :	MAI/0298
En ligne :	https://drive.google.com/file/d/1ceh0in6uDQMW_m9_7t3zxipgLKgjvYcx/view?usp=shari [...]
Format de la ressource électronique :	pdf

Exemplaires (1)

Code-barres	Cote	Support	Localisation	Section	Disponibilité
MAI/0298	MAI/0298	Mémoire	Bibliothéque des sciences	Français	Disponible Disponible

A-
A
A+

Accueil

Se connecter

Mot de passe oublié ?

Adresse

Université Sétif -1- faculté des sciences el bez Sétif
19000 Sétif
Algérie

Horaires d'ouverture :

Dimanche:  8:00h-16h30
Lundi:         8:00h-16h30
Mardi:         8:00h-16h30
Mercredi:    8:00h-16h30
Jeudi:         8:00h-16h30