| Titre : | Towards a robust natural language understanding : Bridging the low-resource gap in algerian dialect through MSA knowledge transfer |
| Auteurs : | Mafaza Chabane, Auteur ; Fouzi Harrag, Directeur de thèse |
| Type de document : | document électronique |
| Editeur : | Sétif : Universite ferhat abbas faculté des sciences département d’informatique, 2025 |
| ISBN/ISSN/EAN : | E-TH/2524 |
| Format : | 1 vol. (170 f.) / ill. en coul. |
| Note générale : | Bibliogr. |
| Langues: | Français |
| Catégories : | |
| Résumé : |
Natural Language Processing (NLP) technologies have seen remarkable progress in recent years, unlocking new possibilities across domains such as education, healthcare, and social media. However, this progress remains largely confined to high-resource languages, leaving low-resource varieties, particularly Arabic dialects like Algerian Arabic underrepresented. These dialects face compounded challenges: lack of standardized orthography, code-switching with French and Modern Standard Arabic (MSA), rich morphological structures, and a persistent scarcity of annotated data. This thesis addresses these limitations by leveraging cross-lingual transfer learning and multitask learning, techniques to bridge the resource divide between MSA and Algerian Arabic. Central to this research is the hypothesis that linguistic proximity between MSA and Algerian Arabic can be systematically exploited to enhance NLP model performance in dialectal tasks. To validate this, a series of experiments was conducted, beginning with evaluations of classical machine learning models and pre-trained transformer architectures, revealing their limitations when applied to unstructured dialectal data. These observations motivated the development of two novel computational frameworks tailored for low-resource scenarios. The first contribution,WASL-DI, is a hybrid dialect identification system that combines contextual embeddings from the CAMeLBERT MSA model with semantic representations derived from FastText. This dual-path architecture captures both deep contextual and subword-level features, making it robust against noise and lexical variation common in informal dialectal content. It achieved a peak accuracy of 99.24% on the dataset used, outperforming benchmark models such as DziriBERT and MDA-BERT. The second major innovation is SILAA-SA, a multitask learning framework for sentiment analysis. It incorporates a Mixture of Experts (MoE) mechanism to dynamically share knowledge between MSA and dialectal inputs. The model uses shared layers for general language understanding and task-specific experts for capturing dialectal nuances, ensuring efficient knowledge transfer without semantic interference. Extensive experiments across multiple dialectal sentiment datasets show that SILAA-SA outperforms traditional singletask models and adapts well to cross-domain tasks such as fake news detection. SILAA-SA achieved 86.81% accuracy on the FASSILA dataset and outperformed existing models across several dialectal benchmarks, including MAC, MYC, TSAC, and ArSarcasm-v2, with strong cross-domain performance on fake news detection tasks as well. Ablation studies further confirmed the effectiveness of architectural components like embedding fusion and MoE design in achieving performance gains. Beyond model accuracy, this work addresses broader concerns in dialectal Natural Language Processing (NLP) by providing reproducible pipelines, annotated datasets, and adaptable architectures to support future research in Arabic and other low-resource languages. Importantly, the research also promotes inclusivity by extending NLP capabilities to communities often excluded from technological advances. By demonstrating how high-resource language assets like MSA can be repurposed for low-resource dialects, the thesis sets forth scalable and practical strategies that address real-world data limitations. In conclusion, this work makes substantial contributions to the field of cross-lingual and multitask learning for dialectal Arabic. It lays a robust foundation for further exploration in areas such as equitable language technology. The outcomes have direct implications for building more inclusive AI systems capable of understanding the diverse linguistic landscape of the Arabic-speaking world. |
| Côte titre : |
E-TH/2524 |
| En ligne : | http://dspace.univ-setif.dz:8888/jspui/retrieve/12965/2524.pdf |
Exemplaires (1)
| Cote | Support | Localisation | Disponibilité |
|---|---|---|---|
| E-TH/2524 | Thèse | Bibliothèque centrale | Disponible |
Accueil

