Enhancing retrieval performance of embedding models via fine-tuning on synthetic data in RAG chatbot for Vietnamese military science domain
Abstract
Retrieval-Augmented Generation (RAG) is a technology that combines information retrieval with large language models, enabling chatbots to provide accurate answers by querying relevant documents from a data repository before generating responses. While RAG chatbot has demonstrated effectiveness in many applications, there are still limitations in specialized Vietnamese data domains, particularly in the military science field. To address this challenge, this paper proposes a framework for fine-tuning embedding models using synthetic datasets generated by ChatGPT to enhance retrieval performance in a Q&A application focused on the history of the Institute of Information Technology (IoIT). Our approach evaluates 11 popular embedding models and shows a significant average improvement of 18.15% in the MAP@K metric. The resulting IoIT history Q&A chatbot, developed with fine-tuned embedding models and the Vietnamese language model Vistral-7B, outperforms chatbots utilizing OpenAI's embedding models and ChatGPT. These findings highlight the potential of RAG chatbot technology for advancing information retrieval applications in specialized fields like military science.