Advancing Thai Sentence Embedding: Benchmark Development

Authors

  • Panuthep Tasawong

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Peerat Limkonchotiwat

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Wuttikorn Ponwitayarat

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Surapon Nonesung

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Sitiporn Sae Lim

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Chayapat Uthayopas

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Can Udomcharoenchaikit

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

  • Sarana Nutanong

    School of Information Science and Technology, Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong 21210, Thailand

DOI:

https://doi.org/10.30564/fls.v7i12.12023
Received: 10 September 2025 | Revised: 27 September 2025 | Accepted: 29 September 2025 | Published Online: 19 November 2025

Abstract

Sentence embedding is the task of capturing textual information in contextualized vectors, which has attracted considerable attention in recent years due to its effectiveness in a wide range of downstream NLP applications, such as classification, retrieval, and semantic search. Despite substantial progress, particularly for English, the study of sentence embeddings in resource-constrained languages like Thai remains underexplored. Existing Thai benchmarks are limited in scope, as they primarily evaluate models on text classification, leaving other important tasks insufficiently examined. To address this gap, we introduce the Thai Sentence Embedding Benchmark, a comprehensive evaluation suite covering diverse tasks including semantic textual similarity (STS), text classification, pairwise classification, and retrieval. We systematically collect and reformat high-quality Thai texts into embedding-based tasks, ensuring robust and standardized evaluation. Furthermore, we propose a new dataset, Thai STS, specifically designed to fill a crucial gap in evaluating semantic similarity in Thai. Beyond benchmarking, we present new Thai sentence embeddings trained under four different sentence embedding frameworks designed for low-resource settings, with three model sizes spanning monolingual and multilingual encoder-based architectures. This variety enables meaningful insights into the trade-offs between scale, architecture, and resource constraints. Through extensive experiments, we evaluate a broad spectrum of embedding models, including newly developed large language models (LLMs), smaller language models (SLMs), and off-the-shelf API-based systems. Our findings highlight both strengths and persistent challenges across tasks, providing guidance for future work. All datasets, models, and code are released under the Apache-2.0 License to support open, reproducible, and community-driven progress in Thai NLP community.

Keywords:

Sentence Embedding Evaluation; Text Classification; Retrieval; Semantic Textual Similarity

References

[1] Devlin, J., Chang, M.-W., Lee, K., et al., 2019. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.

[2] Liu, Y., Lin, W., Shi, Y., et al., 2021. RoBERTa: A Robustly Optimized BERT Pre-training Approach with Post-Training. In Proceedings of the 20th China National Conference on Computational Linguistics, Hohhot, China, 13–15 August 2021; pp. 1218–1227. Available from: http://www.cips-cl.org/static/anthology/CCL-2021/CCL-21-108.pdf

[3] Gao, T., Yao, X., Chen, D., 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910.

[4] Wang, L., Yang, N., Huang, X., et al., 2024. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 11897–11916.

[5] Muennighoff, N., Su, H., Wang, L., et al., 2024. Generative Representational Instruction Tuning. arXiv preprint. arXiv:2402.09906v3. DOI: https://doi.org/10.48550/arXiv.2402.09906

[6] Li, Z., Zhang, X., Zhang, Y., et al., 2023. Towards General Text Embeddings with Multi-Stage Contrastive Learning. arXiv preprint. arXiv:2308.03281v1. DOI: https://doi.org/10.48550/arXiv.2308.03281

[7] Muennighoff, N., Tazi, N., Magne, L., et al., 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037.

[8] Conneau, A., Kiela, D., 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; pp. 1699–1704.

[9] Thakur, N., Reimers, N., Rücklé, A., et al., 2021. BEIR: A Heterogenous Benchmark for Zero-Shot Evaluation of Information Retrieval Models. arXiv preprint. arXiv:2104.08663. DOI: https://doi.org/10.48550/arXiv.2104.08663

[10] Charin, P., Phasathorn, S., 2020. PyThaiNLP Classification Benchmarks. Available from: https://github.com/PyThaiNLP/classification-benchmarks (cited 20 January 2025).

[11] Cer, D., Diab, M., Agirre, E., et al., 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-Lingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 1–14.

[12] Wongso, W., Ananto, J., David, S.S., et al., 2024. Indonesian Sentence Embeddings. Available from: https://github.com/LazarusNLP/indonesian-sentence-embeddings (cited 20 January 2025).

[13] Wang, Y.-S., Wu, A., Neubig, G., 2022. English Contrastive Learning Can Learn Universal Cross-Lingual Sentence Embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9122–9133.

[14] Chen, J., Xiao, S., Zhang, P., et al., 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335.

[15] Hämäläinen, M., Patpong, P., Alnajjar, K., et al., 2021. Detecting Depression in Thai Blog Posts: A Dataset and a Baseline. In Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021), online, 11 November 2021; pp. 20–25.

[16] Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., et al., 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Singapore, 6 December 2023; pp. 25–36.

[17] Lowphansirikul, L., Polpanumas, C., Rutherford, A.T., et al., 2022. A Large English–Thai Parallel Corpus from the Web and Machine-Generated Text. Language Resources and Evaluation. 56(2), 477–499. DOI: https://doi.org/10.1007/s10579-021-09536-6

[18] Payoungkhamdee, P., Porkaew, P., Sinthunyathum, A., et al., 2021. LimeSoda: Dataset for Fake News Detection in Healthcare Domain. In Proceedings of the 2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Ayutthaya, Thailand, 22–23 December 2021; pp. 1–6. DOI: https://doi.org/10.1109/iSAI-NLP54397.2021.9678187

[19] FitzGerald, J., Hench, C., Peris, C., et al., 2023. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, 9–14 July 2023; pp. 4277–4302.

[20] Mollanorozy, S., Tanti, M., Nissim, M., 2023. Cross-Lingual Transfer Learning with Persian. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Dubrovnik, Croatia, 2–6 May 2023; pp. 89–95.

[21] Adelani, D.I., Liu, H., Shen, X., et al., 2024. SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; pp. 226–245.

[22] Lovenia, H., Mahendra, R., Maulana Akbar, S.M., et al., 2024. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 5155–5203.

[23] Pasupa, K., Netisopakul, P., Lertsuksakda, R., 2016. Sentiment Analysis of Thai Children Stories. Artificial Life and Robotics. 21(3), 357–364. DOI: https://doi.org/10.1007/s10015-016-0283-8

[24] Suriyawongkul, A., Chuangsuwanich, E., Chormai, P., et al., 2019. Wisesight Sentiment. Available from: https://github.com/PyThaiNLP/wisesight-sentiment (cited 20 January 2025).

[25] Liu, F., Jiao, Y., Massiah, J., et al., 2022. Trans-Encoder: Unsupervised Sentence-Pair Modelling Through Self- and Mutual-Distillations. arXiv preprint. arXiv:2109.13059. DOI: https://doi.org/10.48550/arXiv.2109.13059

[26] Conneau, A., Rinott, R., Lample, G., et al., 2018. XNLI: Evaluating Cross-Lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2475–2485.

[27] Yang, Y., Cer, D., Ahmad, A., et al., 2020. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Stroudsburg, PA, USA, 5–10 July 2020; pp. 87–94.

[28] Karpukhin, V., Oğuz, B., Min, S., et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), online, 16–20 November 2020; pp. 6769–6781.

[29] Asai, A., Yu, X., Kasai, J., et al., 2021. One Question Answering Model for Many Languages with Cross-Lingual Dense Passage Retrieval. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), online, 6–14 December 2021.

[30] Limkonchotiwat, P., Ponwitayarat, W., Udomcharoenchaikit, C., et al., 2022. CL-ReLKT: Cross-Lingual Language Knowledge Transfer for Multilingual Retrieval Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 2141–2155.

[31] Viriyayudhakorn, K., Charin, P., 2021. iapp wiki_qa_squad. Available from: https://github.com/iapp-technology/iapp-wiki-qa-dataset (cited 20 January 2025).

[32] Zhang, X., Thakur, N., Ogundepo, O., et al., 2023. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics. 11, 1114–1131. DOI: https://doi.org/10.1162/tacl_a_00595

[33] Trakultaweekoon, K., Thaiprayoon, S., Palingoon, P., et al., 2019. The First Wikipedia Questions and Factoid Answers Corpus in the Thai Language. In Proceedings of the 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, 30 October–1 November 2019; pp. 1–4. DOI: https://doi.org/10.1109/iSAI-NLP48611.2019.9045143

[34] Clark, J.H.C., Choi, E., Collins, M., et al., 2020. A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics. 8, 454–470. DOI: https://doi.org/10.1162/tacl_a_00317

[35] Akarajaradwong, P., Pothavorn, P., Chaksangchaichot, C., et al., 2025. NitiBench: Benchmarking LLM Frameworks on Thai Legal Question Answering Capabilities. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), Suzhou, China, 10–14 December 2025; pp. 34292–34315.

[36] Conneau, A., Khandelwal, K., Goyal, N., et al., 2020. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, online, 5–10 July 2020; pp. 8440–8451.

[37] Wenzek, G., Lachaux, M.-A., Conneau, A., et al., 2020. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 4003–4012.

[38] Lowphansirikul, L., Polpanumas, C., Jantrakulchai, N., et al., 2021. WangchanBERTa: Pretraining Transformer-Based Thai Language Models. arXiv preprint. arXiv:2101.09635. DOI: https://doi.org/10.48550/arXiv.2101.09635

[39] Sriwirote, P., Thapiang, J., Timtong, V., et al., 2023. PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords. arXiv preprint. arXiv:2311.12475. DOI: https://doi.org/10.48550/arXiv.2311.12475

[40] Reimers, N., Gurevych, I., 2019. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992.

[41] Reimers, N., Gurevych, I., 2020. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), online, 16–20 November 2020; pp. 4512–4525.

[42] Grattafiori, A., Dubey, A., Jauhri, A., et al., 2024. The Llama 3 Herd of Models. arXiv preprint. arXiv:2407.21783. DOI: https://doi.org/10.48550/arXiv.2407.21783

[43] Rafailov, R., Sharma, A., Mitchell, M., et al., 2023. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. arXiv preprint. arXiv:2305.18290. DOI: https://doi.org/10.48550/arXiv.2305.18290

[44] Pipatanakul, K., Jirabovonvisut, P., Manakul, P., et al., 2023. Typhoon: Thai Large Language Models. arXiv preprint. arXiv:2312.13951. DOI: https://doi.org/10.48550/arXiv.2312.13951

[45] Limkonchotiwat, P., Ponwitayarat, W., Lowphansirikul, L., et al., 2023. An Efficient Self-Supervised Cross-View Training for Sentence Embedding. Transactions of the Association for Computational Linguistics. 11, 1572–1587. DOI: https://doi.org/10.1162/tacl_a_00620.

[46] Limkonchotiwat, P., Ponwitayarat, W., Lowphansirikul, L., et al., 2022. ConGen: Unsupervised Control and Generalization Distillation for Sentence Representation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6467–6480.

[47] Neelakantan, A., Xu, T., Puri, R., et al., 2022. Text and Code Embeddings by Contrastive Pre-Training. arXiv preprint. arXiv:2201.10005. DOI: https://doi.org/10.48550/arXiv.2201.10005

[48] Wang, K., Reimers, N., Gurevych, I., 2021. TSDAE: Using Transformer-Based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 671–688.

Downloads

How to Cite

Tasawong, P., Limkonchotiwat, P., Ponwitayarat, W., Nonesung, S., Sae Lim, S., Uthayopas, C., Udomcharoenchaikit, C., & Nutanong, S. (2025). Advancing Thai Sentence Embedding: Benchmark Development. Forum for Linguistic Studies, 7(12), 1380–1397. https://doi.org/10.30564/fls.v7i12.12023