HỆ CHÚ THÍCH ẢNH TỰ ĐỘNG CHO NGƯỜI KHIẾM THỊ
Abstract
Visual impairment poses significant challenges for visually impaired individuals in recognising and interacting with their surrounding environment. To address this issue, this study proposes a cross-platform automatic image captioning system. The model follows an encoder–decoder architecture, where DenseNet is used to extract visual features, while an LSTM network, combined with an attention mechanism, generates natural language descriptions. The proposed method is trained and evaluated on two benchmark datasets, MS COCO and Flickr30K, using widely adopted metrics such as BLEU and METEOR. Experimental results demonstrate that the system achieves higher accuracy compared to several recently published approaches. Furthermore, a practical application has been developed for both desktop and mobile platforms, enabling the production of audio descriptions for images, thereby enhancing accessibility to visual information for visually impaired users.