Enhancing retrieval performance of embedding models via fine-tuning on synthetic data in RAG chatbot for Vietnamese military science domain

Nguyen Xuan Bac; Luu Van Sang; Nguyen Duc Vuong; Luong Quoc Le; Dang Duc Thinh

Nguyen Xuan Bac Institute of Information Technology and Electronics, Academy of Military Science and Technology
Luu Van Sang Institute of Information Technology and Electronics, Academy of Military Science and Technology
Nguyen Duc Vuong Institute of Information Technology and Electronics, Academy of Military Science and Technology
Luong Quoc Le Institute of Information Technology and Electronics, Academy of Military Science and Technology
Dang Duc Thinh Institute of Information Technology and Electronics, Academy of Military Science and Technology

Keywords: Retrieval-augmented generation; Fine-tuning; Synthetic data; Large language model; Chatbot.

Abstract

Retrieval-Augmented Generation (RAG) is a technology that combines information retrieval with large language models, enabling chatbots to provide accurate answers by querying relevant documents from a data repository before generating responses. While RAG chatbot has demonstrated effectiveness in many applications, there are still limitations in specialized Vietnamese data domains, particularly in the military science field. To address this challenge, this paper proposes a framework for fine-tuning embedding models using synthetic datasets generated by ChatGPT to enhance retrieval performance in a Q&A application focused on the history of the Institute of Information Technology (IoIT). Our approach evaluates 11 popular embedding models and shows a significant average improvement of 18.15% in the MAP@K metric. The resulting IoIT history Q&A chatbot, developed with fine-tuned embedding models and the Vietnamese language model Vistral-7B, outperforms chatbots utilizing OpenAI's embedding models and ChatGPT. These findings highlight the potential of RAG chatbot technology for advancing information retrieval applications in specialized fields like military science.

Enhancing retrieval performance of embedding models via fine-tuning on synthetic data in RAG chatbot for Vietnamese military science domain

Abstract

BỘ KHOA HỌC VÀ CÔNG NGHỆ - MINISTRY OF SCIENCE AND TECHNOLOGY OF VIETNAM

CỤC THÔNG TIN, THỐNG KÊ - NATIONAL AGENCY FOR SCIENCE AND TECHNOLOGY INFORMATION AND STATISTICS