Google Translate or ChatGPT-4? A Multi-Metric Evaluation of Chinese-to-English Technical Translation

Authors

  • Zhongming Zhang

    The Faculty of Modern Languages and Communication, Universiti Putra Malaysia, Serdang 43400, Malaysia

  • Syed Nurulakla Syed Abdullah

    The Faculty of Modern Languages and Communication, Universiti Putra Malaysia, Serdang 43400, Malaysia

  • Muhammad Alif Redzuan Abdullah

    The Faculty of Modern Languages and Communication, Universiti Putra Malaysia, Serdang 43400, Malaysia

  • Lina Zhou

    The Faculty of Modern Languages and Communication, Universiti Putra Malaysia, Serdang 43400, Malaysia

DOI:

https://doi.org/10.30564/fls.v7i9.11014
Received: 11 July 2025 | Revised: 21 July 2025 | Accepted: 28 July 2025 | Published Online: 15 September 2025

Abstract

The advent of large language models (LLMs), such as ChatGPT, has opened new avenues for machine translation (MT), particularly in specialised domains such as technical documentation. However, their performance, relative to neural MT systems like Google Neural Machine Translation (GNMT), lacks empirical validation for the Chinese-English language pair. This study aims to compare the Chinese-English translation quality of GNMT and ChatGPT-4 in technical manuals, evaluate the variability of six widely used automatic metrics, and examine their correlation with human assessment. A parallel bilingual corpus of eighty aligned segments from technical manuals was constructed. Translations generated by GNMT and ChatGPT-4 were evaluated using standard automatic lexical metrics (BLEU, METEOR, and CHRF), semantic metrics (BLEURT, BERTScore, and COMET-QE), and human assessments. Statistical analyses employed paired t-tests, Wilcoxon signed-rank tests, Friedman tests with Wilcoxon post hoc comparisons, and Spearman correlations. The results showed that human evaluators preferred ChatGPT-4 over GNMT for technical manual translation, whereas all automatic metrics favoured GNMT. Automatic evaluation revealed notable inconsistencies, with partial alignment observed in COMET-QE-related comparisons. Correlation patterns differed across systems: only semantic metrics exhibited limited correlations with human assessments for GNMT. In contrast, for ChatGPT-4, lexical metrics exhibited moderate to low correlations, whereas semantic metrics demonstrated no meaningful association. These findings highlight ChatGPT-4’s advantage in human-judged translation quality, while also underscoring the misalignment between automatic metrics and human assessments in LLM-based machine translation, thereby reinforcing the need for more context-sensitive and adaptive evaluation approaches.

Keywords:

Automatic Evaluation Metrics; ChatGPT-4; Google Neural Machine Translation (GNMT); Technical Manual Translation

References

[1] Ahammad, S.H., Kalangi, R.R., Nagendram, S., et al., 2024. Improved neural machine translation using Natural Language Processing (NLP). Multimedia Tools and Applications. 83(13), 39335–39348. DOI: https://doi.org/10.1007/s11042-023-17207-7

[2] Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning to align and translate. Preprint. arXiv:1409.0473.

[3] Lyu, C., Du, Z., Xu, J., et al., 2023. A paradigm shift: The future of machine translation lies with large language models. Preprint. arXiv:2305.01181. DOI: https://doi.org/10.48550/arXiv.2305.01181

[4] Chan, V., Tang, W.K.W., 2024. GPT for Translation: A systematic literature review. SN Computer Science. 5(8), 1–9. DOI: https://doi.org/10.1007/s42979-024-03340-z

[5] Zhu, S., Xu, S., Sun, H., et al., 2024. Multilingual large language models: A systematic survey. Preprint. arXiv:2411.11072. DOI: https://doi.org/10.48550/arXiv.2411.11072

[6] Ouyang, L., Wu, J., Jiang, X., et al., 2022. Training language models to follow instructions with human feedback. Preprint. arXiv:2203.02155. DOI: https://doi.org/10.48550/arXiv.2203.02155

[7] Kocmi, T., Federmann, C., Grundkiewicz, R., et al., 2021. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. Preprint. arXiv:2107.10821. DOI: https://doi.org/10.48550/arXiv.2107.10821

[8] Jiao, W., Wang, W., Huang, J.T., et al., 2023. Is ChatGPT a good translator? A preliminary study. Preprint. arXiv:2301.08745.

[9] Obeidat, M.M., Haider, A.S., Tair, S.A., et al., 2024. Analyzing the performance of Gemini, ChatGPT, and Google Translate in rendering English idioms into Arabic. FWU Journal of Social Sciences. 18(4).

[10] Al-Maaytah, M., Almahasees, Z., 2024. A linguistic investigation for a case study of ChatGPT and Google Translate in rendering special needs texts from English into Arabic: A synchronic case study. Pakistan Journal of Life & Social Sciences. 22(2).

[11] Byrne, J., 2006. Technical Translation: Usability Strategies for Translating Technical Documentation. Springer: Dordrecht, Netherlands.

[12] Olohan, M., 2015. Scientific and Technical Translation. Routledge: Abingdon, UK.

[13] Zayed, A.B., 2024. Evaluating the fidelity and accuracy of ChatGPT 4 and Google Translate in translating legal English documents into Arabic—and vice versa. Faculty of Languages Journal-Tripoli-Libya. 1(29), 63–87.

[14] Alzain, E., Nagi, K.A., Algobaei, F., 2024. The quality of Google Translate and ChatGPT English to Arabic translation: The case of scientific text translation. Forum for Linguistic Studies. 6(3), 837–849. DOI: https://doi.org/10.30564/fls.v6i3.6799

[15] Son, J., Kim, B., 2023. Translation performance from the user's perspective of large language models and neural machine translation systems. Information. 14(10), 574. DOI: https://doi.org/10.3390/info14100574

[16] Papineni, K., Roukos, S., Ward, T., et al., 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318.

[17] Banerjee, S., Lavie, A., 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72.

[18] Rei, R., Stewart, C., Farinha, A.C., et al., 2020. COMET: A neural framework for MT evaluation. Preprint. arXiv:2009.09025. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.213

[19] Chatzikoumi, E., 2020. How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering. 26(2), 137–161. DOI: https://doi.org/10.1017/S1351324919000469

[20] Zhang, T., Kishore, V., Wu, F., et al., 2019. BERTScore: Evaluating text generation with BERT. Preprint. arXiv:1904.09675.

[21] Sellam, T., Das, D., Parikh, A.P., 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7881–7892. DOI: https://doi.org/10.18653/v1/2020.acl-main.704

[22] Rei, R., Farinha, A.C., Zerva, C., et al., 2021. Are references really needed? UNBABEL-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, Online, 10–11 November 2021; pp. 1030–1040.

[23] Zuo, Y., Abdullah, S.S., Toh, F.H.C., 2023. Strategies for translating culture-specific items from Chinese into English. World Journal of English Language. 13(7), 27–38. DOI: https://doi.org/10.5430/wjel.v13n7p27

[24] Zahrawi, R.M.T., Abdullah, S.N.S., Mustapha, N.F., et al., 2024. Strategies for translating Arabic similes in Al-Manfaluti's Al-Abrat into English. International Journal of Academic Research in Progressive Education and Development. 13(1), 223–239. DOI: https://doi.org/10.6007/IJARPED/v13-i1/20002

[25] Kaji, H., 1999. Controlled languages for machine translation: State of the art. In Proceedings of Machine Translation Summit VII, Singapore, 13–17 September 1999; pp. 37–39.

[26] Wright, S.E., 2011. Scientific, technical, and medical translation. In: Malmkjær, K., Windle, K. (eds.). The Oxford Handbook of Translation Studies (online ed.). Oxford University Press: Oxford, UK. DOI: https://doi.org/10.1093/oxfordhb/9780199239306.013.0018

[27] Suima, I., 2024. Scientific and technical texts: Translation aspects in electrical and computer engineering. Challenges and Issues of Modern Science. 3, 74–82.

[28] Axunbabayeva, N., Yunusova, N., 2020. The importance of consistent terminology in technical translation. The Scientific Heritage. (49-3), 31–33.

[29] Barák, A., 2024. Comparing machine translation effectivity of selected engines from English into Slovak on the example of a scientific text. L10N Journal. 3(2), 7–28.

[30] Sadiq, S., 2025. Evaluating English–Arabic translation: Human translators vs. Google Translate and ChatGPT. Journal of Languages and Translation. 12(1), 67–95. DOI: https://doi.org/10.21608/jltmin.2025.423147

[31] Karim, H.A., 2024. ChatGPT vs. DeepL: Comparing the English translation quality of digital business and information technology texts using BLEU metric. Journal of Digital Business and Information Technology. 1(2), 50–60. DOI: https://doi.org/10.23971/jobit.v1i2.297

[32] Callison-Burch, C., Fordyce, C.S., Koehn, P., et al., 2007. (Meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 136–158. DOI: https://doi.org/10.3115/1626355.1626373

[33] Graham, Y., Baldwin, T., Moffat, A., et al., 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia, Bulgaria, 8–9 August 2013; pp. 33–41.

[34] Birch, A., Abend, O., Bojar, O., et al., 2016. HUME: Human UCCA-based evaluation of machine translation. Preprint. arXiv:1607.00030. DOI: https://doi.org/10.18653/v1/D16-1134

[35] Lommel, A., Popovic, M., Burchardt, A., 2014. Assessing inter-annotator agreement for translation error annotation. In MTE: Workshop on Automatic and Manual Metrics for Operational Translation Evaluation, Reykjavik, Iceland, 26–31 May 2014; pp. 31–37.

[36] Popović, M., 2018. Error classification and analysis for machine translation quality assessment. In: Translation Quality Assessment: From Principles to Practice. Springer: Cham, Switzerland. pp. 129–158. DOI: https://doi.org/10.1007/978-3-319-91241-7_7

[37] Reiter, E., Belz, A., 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics. 35(4), 529–558. DOI: https://doi.org/10.1162/coli.2009.35.4.35405

[38] Popović, M., 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 11–12 September 2015; pp. 392–395. DOI: https://doi.org/10.18653/v1/W15-3049

[39] Ghosh, S., Ghose, A., Chattopadhya, R., et al., 2024. A study on evaluation techniques for machine translation. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2024; pp. 1–7. DOI: https://doi.org/10.1109/ICCCNT61001.2024.10723917

[40] Deutsch, D., Dror, R., Roth, D., 2022. On the limitations of reference-free evaluations of generated text. Preprint. arXiv:2210.12563. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.753

[41] Ulitkin, I., Filippova, I., Ivanova, N., et al., 2021. Automatic evaluation of the quality of machine translation of a scientific text: the results of a five-year-long experiment. E3S Web of Conferences 284. 08001. DOI: https://doi.org/10.1051/e3sconf/202128408001

[42] Ding, L., 2024. A comparative study on the quality of English-Chinese translation of legal texts between ChatGPT and neural machine translation systems. Theory and Practice in Language Studies. 14(9), 2823–2833. DOI: https://doi.org/10.17507/tpls.1409.18

[43] Briva-Iglesias, V., Camargo, J.L.C., Dogru, G., 2024. Large language models “ad referendum”: How good are they at machine translation in the legal domain?. MonTI. 16. DOI: https://doi.org/10.6035/MonTI.2024.16.02

[44] Sanz-Valdivieso, L., López-Arroyo, B., 2023. Google Translate vs. ChatGPT: Can non-language professionals trust them for specialized translation?. In Proceedings of the International Conference Human-informed Translation and Interpreting Technology (HiT-IT 2023); pp. 97–107. DOI: https://doi.org/10.26615/issn.2683-0078.2023_008

[45] AlAfnan, M.A., 2025. Large language models as computational linguistics tools: A comparative analysis of ChatGPT and Google machine translations. Journal of Artificial Intelligence and Technology. 5, 20–32. DOI: https://doi.org/10.37965/jait.2024.0549

[46] Mohsen, M., 2024. Artificial intelligence in academic translation: A comparative study of large language models and Google Translate. Psycholinguistics. 35(2), 134–156. DOI: https://doi.org/10.31470/2309-1797-2024-35-2-134-156

[47] Brewster, R.C., Gonzalez, P., Khazanchi, R., et al., 2024. Performance of ChatGPT and Google Translate for pediatric discharge instruction translation. Pediatrics. 154(1), e2023065573. DOI: https://doi.org/10.1542/peds.2023-065573

[48] Sizov, F., España-Bonet, C., van Genabith, J., et al., 2024. Analysing translation artifacts: A comparative study of LLMs, NMTs, and human translations. In Proceedings of the Ninth Conference on Machine Translation; pp. 1183–1199. DOI: https://doi.org/10.18653/v1/2024.wmt-1.116

[49] Cai, L., 2024. How does ChatGPT compare with conventional neural machine translation systems in performing a Chinese to English translation task? Journal of Translation Studies. 4(1), 25–45. DOI: https://doi.org/10.3726/JTS012024.02

[50] Jiang, Z., Lv, Q., Zhang, Z., et al., 2024. Convergences and divergences between automatic assessment and human evaluation: Insights from comparing ChatGPT-generated translation and neural machine translation. Preprint. arXiv:2401.05176. DOI: https://doi.org/10.48550/arXiv.2401.05176

[51] Emery, D., Goitia, M., Vargus, F., et al., 2025. HalluMix: A task-agnostic, multi-domain benchmark for real-world hallucination detection. Preprint. arXiv:2505.00506.

[52] Bowker, L., Ciro, J.B., 2019. Machine Translation and Global Research: Towards Improved Machine Translation Literacy in the Scholarly Community. Emerald Publishing Limited: Leeds, England. DOI: https://doi.org/10.1108/9781787567214

[53] Zouhar, V., Bojar, O., 2024. Quality and quantity of machine translation references for automatic metrics. Preprint. arXiv:2401.01283. DOI: https://doi.org/10.48550/arXiv.2401.01283

[54] Pourkamali, N., Sharifi, S.E., 2024. Machine translation with large language models: Prompt engineering for Persian, English, and Russian directions. Preprint. arXiv:2401.08429.

[55] Post, M., 2018. A call for clarity in reporting BLEU scores. Preprint. arXiv:1804.08771. DOI: https://doi.org/10.18653/v1/W18-6319

[56] Castilho, S., O'Brien, S., 2017. Acceptability of machine-translated content: A multi-language evaluation by translators and end-users. Linguistica Antverpiensia, New Series – Themes in Translation Studies. 16. DOI: https://doi.org/10.52034/lanstts.v16i0.430

[57] Han, J., Kamber, M., Pei, J., 2012. Data Mining: Concepts and Techniques, 3rd ed. Morgan Kaufmann: Burlington, MA, USA.

[58] Cohen, J., 1988. The effect size. In: Statistical Power Analysis for the Behavioral Sciences. Routledge: Abingdon, UK. pp. 77–83.

[59] Mukherjee, A., Shrivastava, M., 2025. Lost in translation? Found in evaluation: A comprehensive survey on sentence-level translation evaluation. ACM Computing Surveys. DOI: https://doi.org/10.1145/3735970

[60] Glushkova, T., Zerva, C., Rei, R., et al., 2021. Uncertainty-aware machine translation evaluation. Preprint. arXiv:2109.06352. DOI: https://doi.org/10.18653/v1/2021.findings-emnlp.330

[61] Yan, Y., Wang, T., Zhao, C., et al., 2023. BLEURT has universal translations: An analysis of automatic metrics by minimum risk training. Preprint. arXiv:2307.03131. DOI: https://doi.org/10.18653/v1/2023.acl-long.297

[62] Balashov, Y., 2025. Translation in the wild. Preprint. arXiv:2505.23548. DOI: https://doi.org/10.48550/arXiv.2505.23548

[63] Wang, L., Lyu, C., Ji, T., et al., 2023. Document-level machine translation with large language models. Preprint. arXiv:2304.02210. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.1036

[64] Marie, B., 2022. An automatic evaluation of the WMT22 general machine translation task. Preprint. arXiv:2209.14172. DOI: https://doi.org/10.48550/arXiv.2209.14172

[65] He, T., Zhang, J., Wang, T., et al., 2022. On the blind spots of model-based evaluation metrics for text generation. Preprint. arXiv:2212.10020. DOI: https://doi.org/10.18653/v1/2023.acl-long.674

[66] Sai, A.B., Mohankumar, A.K., Khapra, M.M., 2022. A survey of evaluation metrics used for NLG systems. ACM Computing Surveys. 55(2), 1–39. DOI: https://doi.org/10.1145/3485766

[67] Glushkova, T., Zerva, C., Martins, A.F.T., 2023. BLEU meets COMET: Combining lexical and neural metrics towards robust machine translation evaluation. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation; pp. 47–58.

[68] Freitag, M., Rei, R., Mathur, N., et al., 2022. Results of WMT22 metrics shared task: Stop using BLEU—neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT); pp. 46–68.

[69] Lee, S., Lee, J., Moon, H., et al., 2023. A survey on evaluation metrics for machine translation. Mathematics. 11(4), 1006. DOI: https://doi.org/10.3390/math11041006

[70] Qian, S., Sindhujan, A., Kabra, M., et al., 2024. What do large language models need for machine translation evaluation? Preprint. arXiv:2410.03278. DOI: https://doi.org/10.18653/v1/2024.emnlp-main.214

Downloads

How to Cite

Zhang, Z., Syed Abdullah, S. N., Abdullah, M. A. R., & Zhou, L. (2025). Google Translate or ChatGPT-4? A Multi-Metric Evaluation of Chinese-to-English Technical Translation. Forum for Linguistic Studies, 7(9), 770–788. https://doi.org/10.30564/fls.v7i9.11014

Issue

Article Type

Article