Innovative Machine Learning Approaches for Drinking Water Quality Classification: Addressing Data Imbalances with Custom SMOTE Sampling Strategy

Authors

  • Borislava Toleva

    Faculty of Economics and Business Administration, Sofia University St. Kl. Ohridski, Sofia 1113, Bulgaria

  • Ivan Ivanov

    Faculty of Economics and Business Administration, Sofia University St. Kl. Ohridski, Sofia 1113, Bulgaria

  • Kalina Kitova

    Faculty of Economics and Business Administration, Sofia University St. Kl. Ohridski, Sofia 1113, Bulgaria

DOI:

https://doi.org/10.30564/jees.v7i3.8195
Received: 23 December 2024 | Revised: 17 January 2025 | Accepted: 21 January 2025 | Published Online: 10 March 2025

Abstract

This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity, human health, and the world economy. The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and, hence, proper environmental management. This study addresses these challenges by proposing an effective machine learning methodology applied to the “Water Quality” public dataset. The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy, sensitivity, and specificity. The proposed methodology is based on two novel approaches: (a) the SMOTE method to deal with unbalanced data and (b) the skillfully involved classical machine learning models. This paper uses Random Forests, Decision Trees, XGBoost, and Support Vector Machines because they can handle large datasets, train models for handling skewed datasets, and provide high accuracy in water quality classification. A key contribution of this work is the use of custom sampling strategies within the SMOTE approach, which significantly enhanced performance metrics and improved class imbalance handling. The results demonstrate significant improvements in predictive performance, achieving the highest reported metrics: accuracy (98.92% vs. 96.06%), sensitivity (98.3% vs. 71.26%), and F1 score (98.37% vs. 79.74%) using the XGBoost model. These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance. The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring, assessing, and managing drinking water quality, ensuring better ecosystem and public health outcomes.

Keywords:

Data Modeling; Class Imbalance; SMOTE; Machine Learning Classification; Model Estimation; Water Quality Dataset

References

[1] World Vision, 2024. Global water crisis: Facts, FAQs, and how to help. Available from: https://www.worldvision.org/clean-water-news-stories/global-water-crisis-facts (cited 10 September 2024).

[2] United Nations, 2024. Goal 6: Ensure access to water and sanitation for all. Available from: https://www.un.org/sustainabledevelopment/water-and-sanitation/ (cited 10 October 2024).

[3] Patel, J., Amipara, C., Ahanger, T.A., et al., 2022. A machine learning-based water potability prediction model by using synthetic minority oversampling technique and explainable AI. Computational Intelligence and Neuroscience. 1–15. DOI: https://doi.org/10.1155/2022/9283293

[4] Aldhyani, T.H., Al-Yaari, M., Alkahtani, H., et al., 2020. Water quality prediction using artificial intelligence algorithms. Applied Bionics and Biomechanics. 1–12. DOI: https://doi.org/10.1155/2020/6659314

[5] Al Duhayyim, M., Mengash, H.A., Aljebreen, M., et al., 2022. Smart water quality prediction using atom search optimization with fuzzy deep convolutional network. Sustainability. 14(24), 16465. DOI: https://doi.org/10.3390/su142416465

[6] Rustam, F., Ishaq, A., Kokab, S.T., et al., 2022. An artificial neural network model for water quality and water consumption prediction. Water. 14(21), 3359. DOI: https://doi.org/10.3390/w14213359

[7] Azrour, M., Mabrouki, J., Fattah, G., et al., 2022. Machine learning algorithms for efficient water quality prediction. Modeling Earth Systems and Environment. 8(2), 2793–2801. DOI: https://doi.org/10.1007/s40808-021-01266-6

[8] Ivanov, I., Toleva, B., 2023. Predicting the water potability index using machine learning. Environment and Ecology Research. 11(4), 537–542. DOI: https://doi.org/10.13189/eer.2023.110402

[9] Azween, A., Himakshi, Ch., Siddhesh, F., et al., 2023. Reliable and efficient model for water quality prediction and forecasting. International Journal of Advanced Computer Science and Applications. 14(12). DOI: https://doi.org/10.14569/IJACSA.2023.0141219

[10] Makaba, T., Dogo, E.M., 2019. A comparison of strategies for missing values in data on machine learning classification algorithms. Proceedings of the International Multidisciplinary Information Technology and Engineering Conference (IMITEC), Vanderbijlpark; South Africa; 21–22 November 2019. pp. 280–287. DOI: https://doi.org/10.1109/IMITEC45504.2019.9015889

[11] Kaggle, 2022. Water quality. Available from: https://www.kaggle.com/datasets/mssmartypants/water-quality (cited 15 May 2024).

[12] Torky, M., Bakhiet, A., Bakrey, M., et al., 2023. Recognizing safe drinking water and predicting water quality index using machine learning framework. International Journal of Advanced Computer Science and Applications. 14(1). DOI: https://doi.org/10.14569/IJACSA.2023.0140103

[13] Chawla, N.V., Bowyer, K.W., Hall, L.O., et al., 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 16, 321–357. DOI: https://doi.org/10.48550/arXiv.1106.1813

[14] Rezki, M.K., Mazdadi, M.I., Indriani, F., et al., 2024. Application of SMOTE to address class imbalance in diabetes disease classification utilizing C5.0, Random Forest, and SVM. Journal of Electronic Engineering and Medical Informatics. 6, 343–354. DOI: https://doi.org/10.35882/jeeemi.v6i4.434

[15] Karthick, K., Krishnan, S., Manikandan, R., 2024. Water quality prediction: A data-driven approach exploiting advanced machine learning algorithms with data augmentation. Journal of Water and Climate Change. 15(2), 431–452. DOI: https://doi.org/10.2166/wcc.2023.403

[16] Orlov, V., Kukartsev, A., Panfilov, I., et al., 2024. Machine learning in environmental monitoring: The case of water potability prediction. BIO Web of Conferences. 130, 03016. DOI: https://doi.org/10.1051/bioconf/202413003016

[17] Nasir, N., Kansal, A., Alshaltone, O., et al., 2022. Water quality classification using machine learning algorithms. Journal of Water Process Engineering. 48, 102920. DOI: https://doi.org/10.1016/j.jwpe.2022.102920

[18] Zhu, M., Wang, J., Yang, X., et al., 2022. A review of the application of machine learning in water quality evaluation. Eco-Environment and Health. 1, 107–116. DOI: https://doi.org/10.1016/j.eehl.2022.06.001

[19] Firdiani, F., Mandala, S., Adiwijaya, A., et al., 2024. WaQuPs: A ROS-integrated ensemble learning model for precise water quality prediction. Applied Sciences. 14, 262. DOI: https://doi.org/10.3390/app14010262

[20] Nayan, A., Saha, J., Mozumder, A., et al., 2021. A machine learning approach for early detection of fish diseases by analyzing water quality. Trends in Sciences. 18(21), 35. DOI: https://doi.org/10.48550/arXiv.2102.09390

[21] Ali, J., Khan, R., Ahmad, N., et al., 2012. Random forests and decision trees. International Journal of Computer Science. 9(5), 1694–0814. Available from: https://www.uetpeshawar.edu.pk/TRP-G/Dr.Nasir-Ahmad-TRP/Journals/2012/Random%20Forests%20and%20Decision%20Trees.pdf (cited 15 November 2024).

[22] Juna, A., Umer, M., Sadiq, S., et al., 2022. Water quality prediction using KNN imputer and multilayer perceptron. Water. 14, 2592. DOI: https://doi.org/10.3390/w14172592

[23] Wien, M., Schwarz, H., Oelbaum, T., 2007. Performance analysis of SVC. IEEE Transactions on Circuits and Systems for Video Technology. 17, 1194–1203. DOI: https://doi.org/10.1109/TCSVT.2007.905530

[24] Zeravan, A., Abduljabbar, T.H., Sallow, A., et al., 2023. Exploring the power of eXtreme gradient boosting algorithm in machine learning: A review. Academic Journal of Nawroz University. 12(2). DOI: https://doi.org/10.25007/ajnu.v12n2a1612

[25] Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 13–17 August 2016. pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785

[26] Shams, M.Y., Elshewey, A.M., El-kenawy, E.S.M., et al., 2024. Water quality prediction using machine learning models based on grid search method. Multimedia Tools and Applications. 83, 35307–35334. DOI: https://doi.org/10.1007/s11042-023-16737-4

Downloads

How to Cite

Toleva, B., Ivanov, I., & Kitova, K. (2025). Innovative Machine Learning Approaches for Drinking Water Quality Classification: Addressing Data Imbalances with Custom SMOTE Sampling Strategy. Journal of Environmental & Earth Sciences, 7(3), 262–273. https://doi.org/10.30564/jees.v7i3.8195