Enhancing Environmental Sustainability through Machine Learning: Predicting Drug Solubility (LogS) for Ecotoxicity Assessment and Green Pharmaceutical Design

Authors

  • Imane Aitouhanni

    SSLAB, ENSIAS, Mohammed V University, Rabat 10000, Morocco

  • Amine Berqia

    SSLAB, ENSIAS, Mohammed V University, Rabat 10000, Morocco

  • Redouane Kaiss

    Research Laboratory in Economics, Management, and Business Administration, Faculty of Economics and Management, Hassan 1st University, Settat 26000, Morocco

  • Habiba Bouijij

    SSLAB, ENSIAS, Mohammed V University, Rabat 10000, Morocco

  • Yassine Mouniane

    Natural Resources and Sustainable Development laboratory, Faculty of Sciences, Ibn Tofail University, Kenitra 14000, Morocco

DOI:

https://doi.org/10.30564/jees.v7i4.8866
Received: 25 February 2025 | Revised: 11 March 2025 | Accepted: 18 March 2025 | Published Online: 20 March 2025

Abstract

Pharmaceutical pollution is becoming an increasing threat to aquatic environments since inactive compounds do not break down, and the drug products are accumulated in living organisms. The ability of a drug to dissolve in water (i.e., LogS) is an important parameter for assessing a drug's environmental fate, biovailability, and toxicity. LogS is typically measured in a laboratory setting, which can be costly and time-consuming, and does not provide the opportunity to conduct large-scale analyses. This research develops and evaluates machine learning models that can produce LogS estimates and may improve the environmental risk assessments of toxic pharmaceutical pollutants. We used a dataset from the ChEMBL database that contained 8832 molecular compounds. Various data preprocessing and cleaning techniques were applied (i.e., removing the missing values), we then recorded chemical properties by normalizing and, even, using some feature selection techniques. We evaluated logS with a total of several machine learning and deep learning models, including; linear regression, random forests (RF), support vector machines (SVM), gradient boosting (GBM), and artificial neural networks (ANNs). We assessed model performance using a series of metrics, including root mean square error (RMSE) and mean absolute error (MAE), as well as the coefficient of determination (R²).  The findings show that the Least Angle Regression (LAR) model performed the best with an R² value close to 1.0000, confirming high predictive accuracy. The OMP model performed well with good accuracy (R² = 0.8727) while remaining computationally cheap, while other models (e.g., neural networks, random forests) performed well but were too computationally expensive. Finally, to assess the robustness of the results, an error analysis indicated that residuals were evenly distributed around zero, confirming the results from the LAR model.  The current research illustrates the potential of AI in anticipating drug solubility, providing support for green pharmaceutical design and environmental risk assessment. Future work should extend predictions to include degradation and toxicity to enhance predictive power and applicability.

Keywords:

Solubility; Prediction; Machine Learning; Ecotoxicity; LogS

References

[1] Vaudreuil, M.A., Munoz, G., Duy, S.V., et al., 2024. Tracking down pharmaceutical pollution in surface waters of the St. Lawrence River and its major tributaries. Science of the Total Environment. 912, 168680. DOI: https://doi.org/10.1016/j.scitotenv.2023.168680

[2] Aus der Beek, T., Weber, F.A., Bergmann, A., et al., 2016. Pharmaceuticals in the environment—Global occurrences and perspectives. Environmental toxicology and chemistry. 35(4), 823–835. DOI: https://doi.org/10.1002/etc.3339

[3] Ortúzar, M., Esterhuizen, M., Olicón-Hernández, D.R., et al., 2022. Pharmaceutical pollution in aquatic environments: A concise review of environmental impacts and bioremediation systems. Frontiers in Microbiology. 13, 869332. DOI: https://doi.org/10.3389/fmicb.2022.869332

[4] Alshehri, F., Rahman, A., 2023 Coupling Machine and deep learning with explainable artificial intelligence for improving prediction of groundwater quality and decision-making in Arid Region, Saudi Arabia. Water. 15, 2298

[5] Aitouhanni, I., Mouniane, Y., Berqia, A., 2024. Machine learning-powered prediction of molecule solubility: Paving the way for environmental, and energy applications. BIO Web of Conferences. 109, 01037. DOI: https://doi.org/10.1051/bioconf/202410901037

[6] Schwarzenbach, R.P., Escher, B.I., Fenner, K., et al., 2006. The challenge of micropollutants in aquatic systems. Science. 313(5790), 1072–1077. DOI: http://doi.org/10.1126/science.1127291

[7] Goswami, D., Mukherjee, J., Mondal, C., et al., 2024. Bioremediation of azo dye: A review on strategies, toxicity assessment, mechanisms, bottlenecks and prospects. Science of The Total Environment. 954, 176426. DOI: https://doi.org/10.1016/j.scitotenv.2024.176426

[8] Kayode-Afolayan, S.D., Ahuekwe, E.F., Nwinyi, O.C. 2022. Impacts of pharmaceutical effluents on aquatic ecosystems. Scientific African. 17, e01288. DOI: https://doi.org/10.1016/j.sciaf.2022.e01288

[9] Pérez-Lucas, G., Navarro, S., 2024. How Pharmaceutical residues occur, behave, and affect the soil environment. Journal of Xenobiotics. 14, 1343–1377. DOI: https://doi.org/10.3390/jox14040076

[10] Paillet, F.L., 2000. A field technique for estimating aquifer parameters using flow log data. Groundwater. 38(4), 510–521. DOI: https://doi.org/10.1111/j.1745-6584.2000.tb00243.x

[11] Lovrić, M., Pavlović, K., Žuvela, P., et al., 2021. Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?. Journal of Chemometrics. 35(7–8), e3349. https://doi.org/10.1002/cem.3349

[12] Singh, A.K., Bilal, M., Iqbal, H.M.N., et al., 2021. Trends in predictive biodegradation for sustainable mitigation of environmental pollutants: Recent progress and future outlook. Science of The Total Environment. 770, 144561. DOI: https://doi.org/10.1016/j.scitotenv.2020.144561

[13] Tong, X., Mohapatra, S., Zhang, J., et al., 2022. Source, fate, transport and modelling of selected emerging contaminants in the aquatic environment: Current status and future perspectives. Water Research. 217, 118418. DOI: https://doi.org/10.1016/j.watres.2022.118418

[14] Selvaraj, C., Chandra, I., Singh, S.K., 2022. Artificial intelligence and machine learning approaches for drug design: Challenges and opportunities for the pharmaceutical industries. Molecular Diversity. 26, 1893–1913. DOI: https://doi.org/10.1007/s11030-021-10326-z

[15] Cui, S., Gao, Y., Huang, Y., et al., 2023. Advances and applications of machine learning and deep learning in environmental ecology and health. Environmental Pollution. 335, 122358. DOI: https://doi.org/10.1016/j.envpol.2023.122358

[16] Daughton, C.G., Ternes, T.A., 1999. Pharmaceuticals and personal care products in the environment: Agents of subtle change?. Environmental Health Perspectives. 107(6), 907–938. DOI: https://doi.org/10.1289/ehp.99107s6907

[17] Wang, H., Xi, H., Xu, L., et al., 2021. Ecotoxicological effects, environmental fate and risks of pharmaceutical and personal care products in the water environment: A review. Science of The Total Environment. 788, 147819. DOI: https://doi.org/10.1016/j.scitotenv.2021.147819

[18] Domínguez-García, P., Fernández-Ruano, L., Báguena, J., et al., 2024. Assessing the pharmaceutical residues as hotspots of the main rivers of Catalonia, Spain. Environmental Science and Pollution Research. 31, 44080–44095. DOI: https://doi.org/10.1007/s11356-024-33967-7

[19] Avdeef, A., 2015. Suggested improvements for measurement of equilibrium solubility-pH of ionizable drugs. ADMET & DMPK. 3(2), 84–109. DOI: https://doi.org/10.5599/admet.3.2.193

[20] Florence, A.T., Attwood, D., 1998. The solubility of drugs. In: Attwood, D. Florence, A.T., B. (eds.). Physicochemical Principles of Pharmacy. Palgrave: London, UK. pp. 152–198. DOI: https://doi.org/10.1007/978-1-349-14416-7_6

[21] Kramer, R.M., Shende, V.R., Motl, N., et al., 2012. Toward a molecular understanding of protein solubility: Increased negative surface charge correlates with increased solubility. Biophysj. 102, 1907–1915. DOI: https://doi.org/10.1016/j.bpj.2012.01.060

[22] Boobier, S., Hose, D.R.J., Blacker, A.J., et al., 2020. Machine learning with physicochemical relationships: Solubility prediction in organic solvents and water. Nature Communications. 11, 5753. DOI: https://doi.org/10.1038/s41467-020-19594-z

[23] Cenci, F., Diab, S., Ferrini, P., et al., 2024. Predicting drug solubility in organic solvents mixtures: A machine-learning approach supported by high-throughput experimentation. International Journal of Pharmaceutics. 660, 124233. DOI: https://doi.org/10.1016/j.ijpharm.2024.124233

[24] Aitouhanni, I., Berqia, A., 2024. SolvPredict: A comprehensive exploration of predictive models for molecule solubility. International Conference on Intelligent Systems and Computer Vision, ISCV, Fez. DOI: http://doi.org/10.1109/ISCV60512.2024.10620130

[25] Kandhare, P., Kurlekar, M., Deshpande, T., et al., 2025. A review on revolutionizing healthcare technologies with AI and ML applications in pharmaceutical sciences. Drugs Drug Candidates. 4(1), 9. DOI: https://doi.org/10.3390/ddc4010009

[26] Rathi, B.S., Kumar, P.S., Vo, D.V.N., 2021. Critical review on hazardous pollutants in water environment: Occurrence, monitoring, fate, removal technologies and risk assessment. Science of The Total Environment. 797, 149134. DOI: https://doi.org/10.1016/j.scitotenv.2021.149134

[27] Tickner, J.A., Geiser, K., Baima, S., 2022. Transitioning the chemical industry: Elements of a roadmap toward sustainable chemicals and materials. Environment: Science and Policy for Sustainable Development. 64(2), 22–36. DOI: https://doi.org/10.1080/00139157.2022.2021793

[28] Ueda, D., Walston, S.L., Fujita, S., et al., 2024. Climate change and artificial intelligence in healthcare: Review and recommendations towards a sustainable future. Diagnostic and Interventional Imaging. 105(11), 453–459. DOI: https://doi.org/10.1016/j.diii.2024.06.002

[29] Chen, T.L., Kim, H., Pan, S.Y., et al., 2020. Implementation of green chemistry principles in circular economy system towards sustainable development goals: Challenges and perspectives. Science of The Total Environment. 716, 136998. DOI: https://doi.org/10.1016/j.scitotenv.2020.136998

[30] Regona, M., Yigitcanlar, T., Hon, C., et al., 2024. Artificial intelligence and sustainable development goals: Systematic literature review of the construction industry. Sustainable Cities and Society. 108, 105499. DOI: https://doi.org/10.1016/j.scs.2024.105499

[31] Gaudelet, T., Day, B., Jamasb, A.R., et al., 2021. Utilizing graph machine learning within drug discovery and development. Briefings in Bioinformatics. 22(6), 1–22. DOI: https://doi.org/10.1093/bib/bbab159

[32] Xiong, G., Wu, Z., Yi, J., et al., 2021. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Research. 49(W1), W5–W14. DOI: https://doi.org/10.1093/nar/gkab255

[33] Zhou, H.., Shan, M., Qin, L.P., et al., 2023. Reliable prediction of cannabinoid receptor 2 ligand by machine learning based on combined fingerprints. Computers in Biology and Medicine. 152, 106379. DOI: https://doi.org/10.1016/j.compbiomed.2022.106379

[34] Boobier, S., Hose, D.R.J., Blacker, A.J., et al., 2020. Machine learning with physicochemical relationships: Solubility prediction in organic solvents and water. Nature Communications. 11, 5753. DOI: http://doi.org/10.1038/s41467-020-19594-z

[35] Martin, Y.C., 2018. How medicinal chemists learned about log P. Journal of Computer-Aided Molecular Design. 32, 809–819. DOI: http://doi.org/10.1007/S10822-018-0127-9

[36] Chen, J., Sun, Y., Sun, S. 2021. Improving human activity recognition performance by data fusion and feature engineering. Sensors. 21(3), 692. DOI: http://doi.org/10.3390/S21030692

[37] Tan, X.F., Zhu, S.S., Wang, R.P., et al., 2021. Role of biochar surface characteristics in the adsorption of aromatic compounds: Pore structure and functional groups. Chinese Chemical Letters. 32(10), 2939–2946. DOI: https://doi.org/10.1016/j.cclet.2021.04.059

[38] Syed Mustapha, S., 2023. Predictive analysis of students' learning performance using data mining techniques: A comparative study of feature selection methods. Applied System Innovation. 6(5), 86. DOI: https://doi.org/10.3390/asi6050086

[39] Shmuel, A., Glickman, O., Lazebnik, T., 2024. Symbolic regression as a feature engineering method for machine and deep learning regression tasks. Machine Learning: Science and Technology. 5, 025065. DOI: https://doi.org/10.1088/2632-2153/AD513A

[40] Zhang, Y., Wang, Z., 2023. Feature engineering and model optimization based classification method for network intrusion detection. Applied Sciences. 13(16), 9363. DOI: http://doi.org/10.3390/APP13169363

[41] Author 1, A.B., Author 2, C., Author 3, M., et al., Year. ChEMBL Database. Available from: https://www.ebi.ac.uk/chembl/ (cited 28 June 2023).

[42] Bhal, S.K., Year. Application Note LogP-Making Sense of the Value. Available from: www.acdlabs.com (cited 12 July 2024).

[43] Author 1, A.B., Author 2, C., Author 3, M., et al., Year. Linear Regression. Available from: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm (cited 10 February 2024).

[44] Breiman, L., 2001. Random forests. Machine Learning. 45, 5–32. DOI: http://doi.org/10.1023/A:1010933404324

[45] Zhang, H., Ren, X., Chen, S., et al., 2024. Deep optimization of water quality index and positive matrix factorization models for water quality evaluation and pollution source apportionment using a random forest model. Environmental Pollution. 347, 123771. DOI: https://doi.org/10.1016/j.envpol.2024.123771

[46] Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. London, U.K.DOI: https://doi.org/10.1017/CBO9780511801389

[47] Python API Reference — xgboost 2.0.3 documentation. Available from: https://xgboost.readthedocs.io/en/stable/python/python_api.html (cited 19 February 2024).

[48] Satish, N., Anmala, J., Rajitha, K., et al., 2024. A stacking ANN ensemble model of ML models for stream water quality prediction of Godavari River Basin, India. Ecological Informatics. 80, 102500. DOI: http://doi.org/10.1016/J.ECOINF.2024.102500

[49] PyCaret — pycaret 3.0.4 documentation. Available from: https://pycaret.readthedocs.io/en/latest/ (cited 23 February 2024).

[50] Chai, T., Draxler, R.R., 2014. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geoscientific Model Development. 7(3), 1247–1250. DOI: https://doi.org/10.5194/gmd-7-1247-2014

[51] R-Squared - Definition, Interpretation, Formula, How to Calculate. Available from: https://corporatefinanceinstitute.com/resources/data-science/r-squared/ (cited 19 February 2024).

[52] Chicco, D., Warrens, M.J., Jurman, G., 2021. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science. 7, 1–24. DOI: http://doi.org/10.7717/PEERJ-CS.623/SUPP-1

[53] RDKit. Available from: https://www.rdkit.org/ (cited 26 February 2024).

[54] GitHub - pycaret/pycaret: IAn open-source, low-code machine learning library in Python. Available from: https://github.com/pycaret/pycaret (cited 29 December 2024).

Downloads

How to Cite

Aitouhanni, I., Berqia, A., Kaiss, R., Habiba Bouijij, & Mouniane, Y. (2025). Enhancing Environmental Sustainability through Machine Learning: Predicting Drug Solubility (LogS) for Ecotoxicity Assessment and Green Pharmaceutical Design. Journal of Environmental & Earth Sciences, 7(4), 82–95. https://doi.org/10.30564/jees.v7i4.8866