Research and test setup of open source platform apache hadoop and tesseract OCR for big data system in higher education environment

  • Duyen My Trinh
  • Trung Tran
  • Thang Tran Van
  • Thu Nguyen Xuan
  • Phong Hai Bui
Từ khóa: Apache Hadoop, Tesseract OCR, Big Data, Document Digitization, Storage

Tóm tắt

This paper presents the research and experimental setup of an open-source Big Data and OCR system that leverages Apache Hadoop and Tesseract OCR. The primary objective is to digitize and securely store information technology documents, ensuring both efficient storage and accurate retrieval. The study evaluates the system’s effectiveness and applicability in practical environments. Results indicate that the proposed system significantly enhances document management by improving storage, access, and retrieval while streamlining workflow processes and reducing costs. The paper addresses various challenges encountered during implementation and proposes targeted improvements to enhance system performance, scalability, and adaptability. Moreover, future directions focus on refining data processing capabilities, boosting OCR accuracy, and expanding the system’s flexibility to handle a broader range of document types and sizes, making it a robust solution for large-scale document management tasks.

điểm /   đánh giá
Phát hành ngày
2025-12-24