AUTOMATICALLY VIETNAMESE TEXT CLASSIFICATION BY TOPIC
Abstract
The Internet is strongly growing every day with a huge amount of information. The need of data mining and knowledge discovery is also increasing, in which the text classification plays an important role. Many techniques in machine learning are applied in classification process and achieved good results. Nowadays, there are many algorithms used for text classification such as: Naïve Bayes, K-NN, SVM, Maximum Entropy, etc. In this paper, Naïve Bayes, SVM and K-NN algorithms were used to experiment on Vietnamese text classification with 05 datasets belonging to 4 different topics: Tourism, Entertainment, Education and the Law. These datasets were extracted from vnexpress.net website. Some unique identifiers were applied during processing to increase the classification accuracy. The results show that SVM algorithm has the highest accuracy (over 90%) and the lowest amount of execution time.
Keywords: Text classification, Naïve Bayes, K-NN, SVM, algorithm.