Phân loại nội dung tài liệu Web tiếng Việt

Trần  Ngọc Phúc; Phạm  Trần Vũ; Phạm  Công Xuyên; Nguyễn  Vũ Duy Quang

Trần Ngọc Phúc
Phạm Trần Vũ
Phạm Công Xuyên
Nguyễn Vũ Duy Quang

Abstract

This paper presents some research results on using Latent Dirichlet Allocation algorithm, which is about analyzing hidden topics exist in documents, to extract important features of web documents for classification. The features are represented as noun phrases extracted from document text using vector model. In this model, the each document is represented as a vector. The weight of each element of the vector is calculated from its occurence frequency. The classification is then measured based on the similarity of any two documents, which is calculated by the cosine of the two representing vectors. In this paper, Latent Dirichlet Allocation algorithm is used to extract hidden features of webdocuments for similarity calculation and gives very accurate results. A prototype application has been built, and the experiment results showed that the classification of news on Vietnamese websites had the accuracy of about 90%.

Classification of Vietnamese Web document

Abstract

BỘ KHOA HỌC VÀ CÔNG NGHỆ - MINISTRY OF SCIENCE AND TECHNOLOGY OF VIETNAM

CỤC THÔNG TIN KHOA HỌC VÀ CÔNG NGHỆ QUỐC GIA - NATIONAL AGENCY FOR SCIENCE AND TECHNOLOGY INFORMATION