|
| Titre : |
RAG-Based Chatbot for Departmental Documents |
| Type de document : |
document électronique |
| Auteurs : |
Lydia Djaouzi ; Ouafa Chahinez Berarma, Auteur ; Kharchi ,Samia, Directeur de thèse |
| Editeur : |
Setif:UFA |
| Année de publication : |
2025 |
| Importance : |
1 vol (96 f .) |
| Format : |
29 cm |
| Langues : |
Anglais (eng) |
| Catégories : |
Thèses & Mémoires:Informatique
|
| Mots-clés : |
RAG-Based
Chatbot |
| Index. décimale : |
004 Informatique |
| Résumé : |
This thesis presents a novel approach to analyzing the leverage of global events on international
trade transactions by integrating and exploiting data from the GDELT project.
The research develops a comprehensive methodology centered around the construction
and utilization of a knowledge graph.
Initially, a pre-trained language model (LLaMA 3.2) is fine-tuned on a proprietary
dataset of country-level transactions to establish a baseline understanding of trade patterns.
Concurrently, an ontology is conceptualized for GDELT features using a semantic
clustering approach facilitated by the all-MiniLM-L6-v2 large language model (LLM),
defining relationships with the Web Ontology Language (OWL).
An automated process is then implemented to download and populate this ontology
with data from the official GDELT source, forming an initial knowledge graph.
This graph is subsequently enriched through an iterative process that employs the
facebook/bart-large-mnl LLM, which analyzes not only primary GDELT articles but
also articles mentioned in them to identify new features or fill in missing information
within the database.
Finally, leveraging Graph Neural Network techniques, the project enables the extraction
of event-specific subgraphs from the knowledge graph based on user prompts,
allowing for detailed queries to conclude the causal leverage and consequences of specific
events on trade transactions.
This work provides a robust framework for understanding the complex dynamics between
global events and economic transactions, offering enhanced analytical and potentially
predictive capabilities. |
| Note de contenu : |
Sommaire
List of Abbreviations 10
1 Background 14
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 GDELT dataset and its use as a geopolitical signal . . . . . . . . . . . . 14
1.3 UN Comtrade dataset and its relevance in international trade . . . . . . 15
1.4 Large Language Models and Core Techniques . . . . . . . . . . . . . . . 17
1.4.1 Defining Large Language Models (LLMs) . . . . . . . . . . . . . . 17
1.4.2 Techniques for Adapting LLMs . . . . . . . . . . . . . . . . . . . 17
1.4.2.1 Fine-Tuning with QLoRA . . . . . . . . . . . . . . . . . 17
1.4.2.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . 17
1.4.3 Foundational Models Employed . . . . . . . . . . . . . . . . . . . 18
1.4.3.1 Generative Model: LLaMA 3 . . . . . . . . . . . . . . . 18
1.4.3.2 Specialized Models for NLP Tasks . . . . . . . . . . . . 18
1.5 Retrieval-Augmented Generation (RAG) . . . . . . . . . . . . . . . . . . 19
1.5.1 The RAG Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.2 Key RAG Architectures . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Graph-Based Knowledge Representation . . . . . . . . . . . . . . . . . . 20
1.6.1 An Introduction to Knowledge Graphs (KGs) . . . . . . . . . . . 21
1.6.2 Graph Neural Networks (GNNs) for Graph Embeddings . . . . . 21
1.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.7.1 Generation Accuracy Metrics . . . . . . . . . . . . . . . . . . . . 22
1.7.2 Retrieval Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . 22
1.8 System Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . 24
1.9 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9.1 Prediction Assessment Metrics . . . . . . . . . . . . . . . . . . . . 25
1.9.2 Cache Performance Metrics . . . . . . . . . . . . . . . . . . . . . 25
1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 State of the art 28
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Machine Learning and Deep Learning Methods for Economic Time Series
Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Transformer Models for Trade and Economic Prediction . . . . . . . . . . 29
2.4 Large Language Models in Economic Forecasting . . . . . . . . . . . . . 30
2.5 Retrieval-Augmented Generation and Knowledge-Enriched Forecasting
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 GDELT Dataset for News-Aware Trade Forecasting . . . . . . . . . . . . 32
2.7 Gaps and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 System Architecture 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 High-Level System Architecture . . . . . . . . . . . . . . . . . . . 34
3.2 LLaMA fine-tuning and setup . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Base Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Dataset Preprocessing for Fine-Tuning . . . . . . . . . . . . . . . 36
3.2.3 Prompt-Completion Dataset Generation . . . . . . . . . . . . . . 37
3.2.4 Fine-Tuning Methodology: QLoRA . . . . . . . . . . . . . . . . . 37
3.3 Prompt cache mechanism: . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 GDELT-to-Graph transformation: how we build the KG . . . . . . . . . 39
3.4.1 The GDELT database . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1.1 Custom API for Automated GDELT Data Retrieval and
Integration . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Ontology building . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2.1 Semantic Clustering of GDELT Features . . . . . . . . . 41
3.4.2.2 Ontology core concept . . . . . . . . . . . . . . . . . . . 44
3.4.2.3 Ontology Construction from Semantic Clusters . . . . . 45
3.4.2.4 Ontology Population and Knowledge Graph Creation . . 49
3.4.3 Ontology Augmentation . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3.1 Web Scraping and Content Preprocessing . . . . . . . . 51
3.4.3.2 Enriching Null Values . . . . . . . . . . . . . . . . . . . 51
3.4.3.3 Expanding Features . . . . . . . . . . . . . . . . . . . . 52
3.5 Prompt-Driven Subgraph Extraction and Composition . . . . . . . . . . 53
3.6 Prompt Engineering for LLM Tool Selection . . . . . . . . . . . . . . . . 56
3.6.1 Anatomy of the System Prompt . . . . . . . . . . . . . . . . . . . 56
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Use Case: Trade Forecasting 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Core Capability: Cameo code handling . . . . . . . . . . . . . . . 59
4.2 Analytical Functions and Event Ranking . . . . . . . . . . . . . . . . . . 60
4.2.1 Core Analytical Capabilities . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Event Significance Examples (Goldstein Scale) . . . . . . . . . . . 61
4.3 Core Capability: Graceful Handling of Missing Parameters . . . . . . . . 61
4.4 Core Capability: Advanced Data Aggregation and Analysis . . . . . . . . 63
4.5 Use Case: Time-Series Forecasting . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Use Case: Causal Event Analysis . . . . . . . . . . . . . . . . . . . . . . 67
4.7 Ablation Study: Evaluating the Raw Language Model . . . . . . . . . . . 69
4.7.1 Raw Model Performance on Trade Queries . . . . . . . . . . . . . 70
4.7.2 Analysis of Raw Model Limitations . . . . . . . . . . . . . . . . . 71
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Results and Discussion 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1.1 Case Study Selection . . . . . . . . . . . . . . . . . . . . 73
5.2.1.2 Evaluation Scenarios . . . . . . . . . . . . . . . . . . . . 73
5.2.1.3 Model Configuration and Training . . . . . . . . . . . . 74
5.2.2 Case Study 1: Chemical Products (HS 3824) . . . . . . . . . . . . 74
5.2.2.1 Scenario 1: Global Trade Value . . . . . . . . . . . . . . 74
5.2.2.2 Scenario 2: Spanish Exports . . . . . . . . . . . . . . . . 74
5.2.2.3 Scenario 3: Spain-Portugal Bilateral Trade . . . . . . . . 75
5.2.3 Case Study 2: Petroleum Oils . . . . . . . . . . . . . . . . . . . . 76
5.2.3.1 Scenario 1: Global Trade Value . . . . . . . . . . . . . . 76
5.2.3.2 Scenario 2: Netherlands Exports . . . . . . . . . . . . . 76
5.2.3.3 Scenario 3: Lithuania-Slovakia Bilateral Trade . . . . . . 77
5.2.4 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Qualitative Analysis: The Impact of GDELT Context . . . . . . . . . . . 79
5.3.1 Raw LLM vs. GDELT-Enhanced Analysis . . . . . . . . . . . . . 79
5.3.2 The System Without GDELT Context . . . . . . . . . . . . . . . 80
5.4 Case Study: Geopolitical Tension → Trade Disruption . . . . . . . . . . 81
5.5 Efficiency of using cache mechanism . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Impact of Cosine Similarity Threshold on Cache Performance . . 84
5.5.1.1 Cache Hit Ratio (CHR) Variation Across Thresholds . . 84
5.5.1.2 Relevance Score Trends Across Thresholds . . . . . . . . 85
5.5.1.3 Strategic Insights and Recommendations . . . . . . . . . 86
5.5.2 Impact of Retrieval Parameters on Cache and Relevance Scores . 86
5.5.2.1 Effect of TOP K, CHUNK TOP K, and RELATED
CHUNK NUMBER . . . . . . . . . . . . . . . . 87
5.5.2.2 Effect of Re-ranking Threshold . . . . . . . . . . . . . . 89
5.5.2.3 Illustrative Scenario: Tuning for Precision . . . . . . . . 90
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Summary of Work and Contributions . . . . . . . . . . . . . . . . . . . . 91
5.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography 93 |
| Côte titre : |
MAI/1065 |
RAG-Based Chatbot for Departmental Documents [document électronique] / Lydia Djaouzi ; Ouafa Chahinez Berarma, Auteur ; Kharchi ,Samia, Directeur de thèse . - [S.l.] : Setif:UFA, 2025 . - 1 vol (96 f .) ; 29 cm. Langues : Anglais ( eng)
| Catégories : |
Thèses & Mémoires:Informatique
|
| Mots-clés : |
RAG-Based
Chatbot |
| Index. décimale : |
004 Informatique |
| Résumé : |
This thesis presents a novel approach to analyzing the leverage of global events on international
trade transactions by integrating and exploiting data from the GDELT project.
The research develops a comprehensive methodology centered around the construction
and utilization of a knowledge graph.
Initially, a pre-trained language model (LLaMA 3.2) is fine-tuned on a proprietary
dataset of country-level transactions to establish a baseline understanding of trade patterns.
Concurrently, an ontology is conceptualized for GDELT features using a semantic
clustering approach facilitated by the all-MiniLM-L6-v2 large language model (LLM),
defining relationships with the Web Ontology Language (OWL).
An automated process is then implemented to download and populate this ontology
with data from the official GDELT source, forming an initial knowledge graph.
This graph is subsequently enriched through an iterative process that employs the
facebook/bart-large-mnl LLM, which analyzes not only primary GDELT articles but
also articles mentioned in them to identify new features or fill in missing information
within the database.
Finally, leveraging Graph Neural Network techniques, the project enables the extraction
of event-specific subgraphs from the knowledge graph based on user prompts,
allowing for detailed queries to conclude the causal leverage and consequences of specific
events on trade transactions.
This work provides a robust framework for understanding the complex dynamics between
global events and economic transactions, offering enhanced analytical and potentially
predictive capabilities. |
| Note de contenu : |
Sommaire
List of Abbreviations 10
1 Background 14
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 GDELT dataset and its use as a geopolitical signal . . . . . . . . . . . . 14
1.3 UN Comtrade dataset and its relevance in international trade . . . . . . 15
1.4 Large Language Models and Core Techniques . . . . . . . . . . . . . . . 17
1.4.1 Defining Large Language Models (LLMs) . . . . . . . . . . . . . . 17
1.4.2 Techniques for Adapting LLMs . . . . . . . . . . . . . . . . . . . 17
1.4.2.1 Fine-Tuning with QLoRA . . . . . . . . . . . . . . . . . 17
1.4.2.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . 17
1.4.3 Foundational Models Employed . . . . . . . . . . . . . . . . . . . 18
1.4.3.1 Generative Model: LLaMA 3 . . . . . . . . . . . . . . . 18
1.4.3.2 Specialized Models for NLP Tasks . . . . . . . . . . . . 18
1.5 Retrieval-Augmented Generation (RAG) . . . . . . . . . . . . . . . . . . 19
1.5.1 The RAG Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.2 Key RAG Architectures . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Graph-Based Knowledge Representation . . . . . . . . . . . . . . . . . . 20
1.6.1 An Introduction to Knowledge Graphs (KGs) . . . . . . . . . . . 21
1.6.2 Graph Neural Networks (GNNs) for Graph Embeddings . . . . . 21
1.7 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.7.1 Generation Accuracy Metrics . . . . . . . . . . . . . . . . . . . . 22
1.7.2 Retrieval Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . 22
1.8 System Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . 24
1.9 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.9.1 Prediction Assessment Metrics . . . . . . . . . . . . . . . . . . . . 25
1.9.2 Cache Performance Metrics . . . . . . . . . . . . . . . . . . . . . 25
1.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 State of the art 28
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Machine Learning and Deep Learning Methods for Economic Time Series
Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Transformer Models for Trade and Economic Prediction . . . . . . . . . . 29
2.4 Large Language Models in Economic Forecasting . . . . . . . . . . . . . 30
2.5 Retrieval-Augmented Generation and Knowledge-Enriched Forecasting
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 GDELT Dataset for News-Aware Trade Forecasting . . . . . . . . . . . . 32
2.7 Gaps and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 System Architecture 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 High-Level System Architecture . . . . . . . . . . . . . . . . . . . 34
3.2 LLaMA fine-tuning and setup . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Base Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Dataset Preprocessing for Fine-Tuning . . . . . . . . . . . . . . . 36
3.2.3 Prompt-Completion Dataset Generation . . . . . . . . . . . . . . 37
3.2.4 Fine-Tuning Methodology: QLoRA . . . . . . . . . . . . . . . . . 37
3.3 Prompt cache mechanism: . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 GDELT-to-Graph transformation: how we build the KG . . . . . . . . . 39
3.4.1 The GDELT database . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1.1 Custom API for Automated GDELT Data Retrieval and
Integration . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Ontology building . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2.1 Semantic Clustering of GDELT Features . . . . . . . . . 41
3.4.2.2 Ontology core concept . . . . . . . . . . . . . . . . . . . 44
3.4.2.3 Ontology Construction from Semantic Clusters . . . . . 45
3.4.2.4 Ontology Population and Knowledge Graph Creation . . 49
3.4.3 Ontology Augmentation . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3.1 Web Scraping and Content Preprocessing . . . . . . . . 51
3.4.3.2 Enriching Null Values . . . . . . . . . . . . . . . . . . . 51
3.4.3.3 Expanding Features . . . . . . . . . . . . . . . . . . . . 52
3.5 Prompt-Driven Subgraph Extraction and Composition . . . . . . . . . . 53
3.6 Prompt Engineering for LLM Tool Selection . . . . . . . . . . . . . . . . 56
3.6.1 Anatomy of the System Prompt . . . . . . . . . . . . . . . . . . . 56
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 Use Case: Trade Forecasting 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Core Capability: Cameo code handling . . . . . . . . . . . . . . . 59
4.2 Analytical Functions and Event Ranking . . . . . . . . . . . . . . . . . . 60
4.2.1 Core Analytical Capabilities . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Event Significance Examples (Goldstein Scale) . . . . . . . . . . . 61
4.3 Core Capability: Graceful Handling of Missing Parameters . . . . . . . . 61
4.4 Core Capability: Advanced Data Aggregation and Analysis . . . . . . . . 63
4.5 Use Case: Time-Series Forecasting . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Use Case: Causal Event Analysis . . . . . . . . . . . . . . . . . . . . . . 67
4.7 Ablation Study: Evaluating the Raw Language Model . . . . . . . . . . . 69
4.7.1 Raw Model Performance on Trade Queries . . . . . . . . . . . . . 70
4.7.2 Analysis of Raw Model Limitations . . . . . . . . . . . . . . . . . 71
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Results and Discussion 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1.1 Case Study Selection . . . . . . . . . . . . . . . . . . . . 73
5.2.1.2 Evaluation Scenarios . . . . . . . . . . . . . . . . . . . . 73
5.2.1.3 Model Configuration and Training . . . . . . . . . . . . 74
5.2.2 Case Study 1: Chemical Products (HS 3824) . . . . . . . . . . . . 74
5.2.2.1 Scenario 1: Global Trade Value . . . . . . . . . . . . . . 74
5.2.2.2 Scenario 2: Spanish Exports . . . . . . . . . . . . . . . . 74
5.2.2.3 Scenario 3: Spain-Portugal Bilateral Trade . . . . . . . . 75
5.2.3 Case Study 2: Petroleum Oils . . . . . . . . . . . . . . . . . . . . 76
5.2.3.1 Scenario 1: Global Trade Value . . . . . . . . . . . . . . 76
5.2.3.2 Scenario 2: Netherlands Exports . . . . . . . . . . . . . 76
5.2.3.3 Scenario 3: Lithuania-Slovakia Bilateral Trade . . . . . . 77
5.2.4 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Qualitative Analysis: The Impact of GDELT Context . . . . . . . . . . . 79
5.3.1 Raw LLM vs. GDELT-Enhanced Analysis . . . . . . . . . . . . . 79
5.3.2 The System Without GDELT Context . . . . . . . . . . . . . . . 80
5.4 Case Study: Geopolitical Tension → Trade Disruption . . . . . . . . . . 81
5.5 Efficiency of using cache mechanism . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Impact of Cosine Similarity Threshold on Cache Performance . . 84
5.5.1.1 Cache Hit Ratio (CHR) Variation Across Thresholds . . 84
5.5.1.2 Relevance Score Trends Across Thresholds . . . . . . . . 85
5.5.1.3 Strategic Insights and Recommendations . . . . . . . . . 86
5.5.2 Impact of Retrieval Parameters on Cache and Relevance Scores . 86
5.5.2.1 Effect of TOP K, CHUNK TOP K, and RELATED
CHUNK NUMBER . . . . . . . . . . . . . . . . 87
5.5.2.2 Effect of Re-ranking Threshold . . . . . . . . . . . . . . 89
5.5.2.3 Illustrative Scenario: Tuning for Precision . . . . . . . . 90
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Summary of Work and Contributions . . . . . . . . . . . . . . . . . . . . 91
5.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography 93 |
| Côte titre : |
MAI/1065 |
|