Enhancing Image Captioning via Knowledge Graph Integration and R-CNN Network
Abstract
In the era of digitalization, automatic image captioning plays a crucial role in image processing, while traditional methods are limited in understanding Semantic context. This study proposed a model which integrated R-CNN with knowledge graphs to enhance caption accuracy and completeness. The methodology involved constructing knowledge graphs from diverse sources like ImageNet and COCO, integrating into R-CNN via CNN feature extraction, Selective Search region proposals, softmax classification, and bounding box regression. Data collection used web crawlers and APIs, preprocessing removes noise and normalizes to 224 × 224. Training with 80:20 split, SGD optimizer, 0.001 learning rate, 50 epochs. Experimental results achieved 96 % accuracy and 0.75 average IoU on 2 000 test images, outperforming traditional R-CNN (85% accuracy; 0.6 IoU) due to graph semantics, reducing errors in complex images. Novel contribution of the present study included Semantic integration mechanism, improving flexibility for image management, healthcare, mobile apps. Future directions: optimizing complexity, multilingual support, combining with NLP like BERT. The research demonstrated expansion potential, recommended real-world application and dynamic graph updates. The computational complexity of the model increases slightly (~20%) compared to the traditional R-CNN, yet it still ensures real-time processing capability. When compared to Fast R-CNN and YOLO, the proposed model delivers significantly higher accuracy, despite a higher computational cost. This demonstrates a reasonable trade-off between performance and speed.