-
2007
-
1947
-
1708
-
1507
-
1486
A New Model for Automatic Text Classification
DOI:
https://doi.org/10.30564/ese.v3i1.3170Abstract
In this paper,a new method for automatic classification of texts is presented.This system includes two phases;text processing and text categorization.In the first phase,various indexing criteria such as bigram,trigram and quad-gram are presented to extract the properties.Then,in the second phase,the W-SMO machine learning algorithm is used to train the system.In order to evaluate and compare the results of the two criteria of accuracy and readability,Macro-F1 and Micro-F1 have been calculated for different indexing methods. The results of experiments performed on 7676 standard text documents of Reuters showed that the best performance is related to w-smo bigram criteria with accuracy of 95.17 micro and 79.86 macro.Also,the results indicated that our proposed method has the best performance compared to the W-j48,Naïve Bayes,K-NN and Decision Tree algorithms.Keywords:
Text classification; Machine learning; W-SMO; N-gramReferences
[1] Weiyu Zhang; Can Xu, ” Microblog Text Classification System Based on Text CNN and LSA Model”,5th International Conference on Information Science,Computer Technology and Transportation (ISCTT),2020.
[2] XiaoyuLuo, ” Efficient English text classification using selected Machine Learning Techniques”,Alexandria Engineering Journal, Volume 60, Issue 3, Pages 3401-3409, June 2021.
[3] Y. Lin,Y. Qu, Z. Wang, ”A Novel Feature Selection Algorithm for Text Categorization”, Expert Systems with Applications, Vol. 33, pp(1-5), 2007.
[4] http://www.daviddlewis.com/resources/testcollections/reuters21578/.
[6] C. H. Wan, L. H. Lee , R. Rajkumar , D. Isa,” A Hybrid Text Classification Approach with Low Dependency on Parameter by Integrating K-nearest neighbor and Support Vector Machine”, Elsevir 2012.
[7] J. Sreemathy, P. S. Balamurugan,” An Efficient Text Classification Using KNN and Naïve Bayesian”,International Journal on Computer Science and Engineering (IJCSE), Vol. 4 No. 03, March 2012.
[8] Li Y. H. and Jain A. K. , “Classification of text documents”.The Computer Journal 41( 8),pp.537-546,1998.
[9] A. Guran, S. Akyokus, N. G. Bayazit, M. Zahidbgurbuz, ”Turkish Text Categorization Using n-gram word”, International Symposium on Innovations in Intelligent Systems and Applicaitons, June 29 – July 1, 2009.
[10] Wan, C. H., et al. “A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine”. Expert Systems with Applications (2012).DOI: 10.1016/j.eswa.2012.02.068. Elsevir 2012.
[11] Cavnar, William B., “N-Gram-Based Text Filtering For TREC-2,” to appear in the proceedings of The Second Text Retrieval Conference (TREC-2), ed. by,Harman, D.K., NIST, Gaithersburg, Maryland,1993.
[12] C. H. Wan, L. H. Lee , R. Rajkumar , D. Isa,” A Hybrid Text Classification Approach with Low Dependency on Parameter by Integrating K-nearest neighbor and Support Vector Machine”, Elsevir 2012.
[13] Y.Huang, ”Support Vector Machines for Text Categorization Based on Latent Semanticindexing”,Technical report, Electrical and Computer Engineering Department, Johns Hopkins University.
[14] Sebastiani, F “Machine Learning in Automated Text Categorization”, ACM Computing Surveys,Vol. 34,No.1, pp. 107-131, 2002.
[15] M.H. Aghdam,N. Ghasem-Aghaee,M.E. Basiri.” Text feature selection using ant colony optimization”, Expert Systems with Applications,PP(6843–6853),2009.