Classification of Vietnamese Web document

  • Trần Ngọc Phúc
  • Phạm Trần Vũ
  • Phạm Công Xuyên
  • Nguyễn Vũ Duy Quang


This paper presents some research results on using  Latent Dirichlet Allocation algorithm, which is about analyzing hidden topics exist in documents, to extract important features of web documents for classification. The features are represented as noun phrases extracted from document text using vector model. In this model, the each document is represented as a vector. The weight of each element of the vector is calculated from its occurence frequency. The classification is then measured based on the similarity of any two documents, which is calculated by the cosine of the two representing vectors. In this paper, Latent Dirichlet Allocation algorithm is used to extract hidden features of webdocuments for similarity calculation and gives very accurate results. A prototype application has been built, and the experiment results showed that the classification of news on Vietnamese websites had the accuracy of about 90%. 

điểm /   đánh giá