A New Model for Automatic Text Classification

Authors

  • Hekmatullah Mumivand Software Engineering Department, Lorestan University, Aleshtar Higher Education Center, KhorramAbad, Lorestan,IR Iran
  • Rasool Seidi Piri Software Engineering Department, Lorestan University, Aleshtar Higher Education Center, KhorramAbad, Lorestan,IR Iran
  • Fatemeh Kheiraei Engineering Department, Lorestan University, KhorramAbad, Lorestan, IR Iran

DOI:

https://doi.org/10.30564/ese.v3i1.3170

Abstract

In this paper,a new method for automatic classification of texts is presented.This system includes two phases;text processing and text categorization.In the first phase,various indexing criteria such as bigram,trigram and quad-gram are presented to extract the properties.Then,in the second phase,the W-SMO machine learning algorithm is used to train the system.In order to evaluate and compare the results of the two criteria of accuracy and readability,Macro-F1 and Micro-F1 have been calculated for different indexing methods. The results of experiments performed on 7676 standard text documents of Reuters showed that the best performance is related to w-smo bigram criteria with accuracy of 95.17 micro and 79.86 macro.Also,the results indicated that our proposed method has the best performance compared to the W-j48,Naïve Bayes,K-NN and Decision Tree algorithms.

Keywords:

Text classification; Machine learning; W-SMO; N-gram

References

[1] Weiyu Zhang; Can Xu, ” Microblog Text Classification System Based on Text CNN and LSA Model”,5th International Conference on Information Science,Computer Technology and Transportation (ISCTT),2020.

[2] XiaoyuLuo, ” Efficient English text classification using selected Machine Learning Techniques”,Alexandria Engineering Journal, Volume 60, Issue 3, Pages 3401-3409, June 2021.

[3] Y. Lin,Y. Qu, Z. Wang, ”A Novel Feature Selection Algorithm for Text Categorization”, Expert Systems with Applications, Vol. 33, pp(1-5), 2007.

[4] http://www.daviddlewis.com/resources/testcollections/reuters21578/.

[5] http://www.rapidi.com.

[6] C. H. Wan, L. H. Lee , R. Rajkumar , D. Isa,” A Hybrid Text Classification Approach with Low Dependency on Parameter by Integrating K-nearest neighbor and Support Vector Machine”, Elsevir 2012.

[7] J. Sreemathy, P. S. Balamurugan,” An Efficient Text Classification Using KNN and Naïve Bayesian”,International Journal on Computer Science and Engineering (IJCSE), Vol. 4 No. 03, March 2012.

[8] Li Y. H. and Jain A. K. , “Classification of text documents”.The Computer Journal 41( 8),pp.537-546,1998.

[9] A. Guran, S. Akyokus, N. G. Bayazit, M. Zahidbgurbuz, ”Turkish Text Categorization Using n-gram word”, International Symposium on Innovations in Intelligent Systems and Applicaitons, June 29 – July 1, 2009.

[10] Wan, C. H., et al. “A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine”. Expert Systems with Applications (2012).DOI: 10.1016/j.eswa.2012.02.068. Elsevir 2012.

[11] Cavnar, William B., “N-Gram-Based Text Filtering For TREC-2,” to appear in the proceedings of The Second Text Retrieval Conference (TREC-2), ed. by,Harman, D.K., NIST, Gaithersburg, Maryland,1993.

[12] C. H. Wan, L. H. Lee , R. Rajkumar , D. Isa,” A Hybrid Text Classification Approach with Low Dependency on Parameter by Integrating K-nearest neighbor and Support Vector Machine”, Elsevir 2012.

[13] Y.Huang, ”Support Vector Machines for Text Categorization Based on Latent Semanticindexing”,Technical report, Electrical and Computer Engineering Department, Johns Hopkins University.

[14] Sebastiani, F “Machine Learning in Automated Text Categorization”, ACM Computing Surveys,Vol. 34,No.1, pp. 107-131, 2002.

[15] M.H. Aghdam,N. Ghasem-Aghaee,M.E. Basiri.” Text feature selection using ant colony optimization”, Expert Systems with Applications,PP(6843–6853),2009.

Downloads

How to Cite

Mumivand, H., Seidi Piri, R., & Kheiraei, F. (2021). A New Model for Automatic Text Classification. Electrical Science & Engineering, 3(1), 10–15. https://doi.org/10.30564/ese.v3i1.3170

Issue

Article Type

Articles