PHÂN LỚP VĂN BẢN TIẾNG VIỆT TỰ ĐỘNG THEO CHỦ ĐỀ

Mạnh Thiên Lý; Vũ Văn Vinh; Nguyễn Văn Lễ; Lâm Thị Họa Mi; Nguyễn Thị Thanh Thủy; Dương Thị Mộng Thùy

Mạnh Thiên Lý
Vũ Văn Vinh
Nguyễn Văn Lễ
Lâm Thị Họa Mi
Nguyễn Thị Thanh Thủy
Dương Thị Mộng Thùy

Abstract

The Internet is strongly growing every day with a huge amount of information. The need of data mining and knowledge discovery is also increasing, in which the text classification plays an important role. Many techniques in machine learning are applied in classification process and achieved good results. Nowadays, there are many algorithms used for text classification such as: Naïve Bayes, K-NN, SVM, Maximum Entropy, etc. In this paper, Naïve Bayes, SVM and K-NN algorithms were used to experiment on Vietnamese text classification with 05 datasets belonging to 4 different topics: Tourism, Entertainment, Education and the Law. These datasets were extracted from vnexpress.net website. Some unique identifiers were applied during processing to increase the classification accuracy. The results show that SVM algorithm has the highest accuracy (over 90%) and the lowest amount of execution time.

Keywords: Text classification, Naïve Bayes, K-NN, SVM, algorithm.

AUTOMATICALLY VIETNAMESE TEXT CLASSIFICATION BY TOPIC

Abstract

BỘ KHOA HỌC VÀ CÔNG NGHỆ - MINISTRY OF SCIENCE AND TECHNOLOGY OF VIETNAM

CỤC THÔNG TIN, THỐNG KÊ - NATIONAL AGENCY FOR SCIENCE AND TECHNOLOGY INFORMATION AND STATISTICS