A Corpus Approach in Language Discovery: A Word Frequency Analysis Based on the Corpus Outcomes in Kazakh

Authors

  • Assel Ormanova

    Department of General Education Disciplines, Astana IT University, Astana 010000, Kazakhstan

  • Sofya Omarova

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Dana Ospanova

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Nurlykhan Aitova

    Department of Kazakh Linguistics, Eurasian National University, Astana 010000, Kazakhstan

  • Gulnaz Tokenkyzy

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Madina Alshynbekova

    Branch campus of Beijing Language and Culture University (BLCU), Astana International University, Astana 010000, Kazakhstan

DOI:

https://doi.org/10.30564/fls.v7i2.8317
Received: 5 November 2024 | Revised: 20 January 2025 | Accepted: 23 January 2025 | Published Online: 20 February 2025

Abstract

This study examines the most frequently used parts of speech and grammatical forms in the texts of the Sub-corpora of the National Corpus of the Kazakh Language (qazcorpora.kz). The frequency of word forms based on the 13-million-word usages in the 2023 corpus database was collected and analyzed both manually and using the functional setting of the corpus software. The study provided key insights into Kazakh journalistic texts’ frequency distribution, grammatical variability, and comparative patterns. The results indicated that: (1) conjunction ‘žäne’ [and], demonstrative pronoun ‘bul’ [this], auxiliary verb ‘dep’ [no translation], noun ‘Kazakh’ [Kazakh], modal verb ‘žoq’ [not], adjective ‘aq’ [white], adverb ‘köp’ [many/much], numeral ‘eki’ [two] showed the highest frequency indicators emphasizing their functional and stylistic roles in text construction in their word class. (2) functional words were the most frequently used part of speech. (3) conjunction ‘žäne’ [and], postposition ‘üšın’ [for] and particle ‘ɣana’ [only] possessed the highest frequency indicators among functional words. This corpus-based research highlights the alignment of Kazakh frequency patterns with global linguistic trends, such as Zipf’s law, while also showcasing unique features attributed to the language’s agglutinative nature.

Keywords:

National Corpus; Frequency Indicator; Part of Speech; Grammatical Form; Corpus Linguistics

References

[1] Mastrantuono, A., Regan, B., 2024. Present perfect and preterit variation in the Spanish of Lima and Mexico City: Findings from a corpus analysis. Corpus Linguistics and Linguistic Theory. 20(2), 375–405. DOI: https://doi.org/10.1515/cllt-2022-0060

[2] Stefanowitsch, A., 2020. Corpus linguistics: A guide to the methodology. (Textbooks in Language Sciences 7). Language Science Press: Berlin, Germany. pp. 1–490.

[3] Jung, Y., Gablasova, D., Brezina, V., et al., 2024. Developing a coding scheme for annotating opinion statements in L2 interactive spoken English with application for language teaching and assessment. Research in Corpus Linguistics. 12(2), 146–173. DOI: https://doi.org/10.32714/ricl.12.02.07

[4] Kilgarriff, A., 2001. Comparing corpora. International Journal of Corpus Linguistics. 6(1), 1–37.

[5] Nilsson, F., 2019. A comparative analysis of word use in popular science and research articles in the natural sciences: A corpus linguistic investigation [PhD Thesis]. Västerås, Sweden: Maraldalen University. pp. 1–89.

[6] Shin, D., Paul, N., 2007. Beyond single words: The most frequent collocations in spoken English. ELT Journal. 62(4), 339-348. DOI: https://doi.org/10.1093/elt/ccm091

[7] Baayen, H., 1992. Statistical models for word frequency distributions: A linguistic evaluation. Computers and the Humanities. 26, 347–363. DOI: https://doi.org/10.1007/BF00136980

[8] Zasorina, L.N., 1997. Chastotnyy slovar russkogo yazyka [The frequency dictionary of the Russian language]. Russkij yazyk: Moscow, Russia. pp. 1–923.

[9] Zotina, E.V., Solovyov, V.D., 2012. Diachronic changes in the frequency of nouns based on the material of the national corpus of the Russian language. Scientific notes of Kazan University. Humanities Series. 154(5), 34–44.

[10] Kim, N.M., 2010. Funktsionirovanie imen chislitel'nykh v publitsisticheskikh tekstakh [The functioning of numerals in journalistic texts]. Vestnik TGPI Gumanitarnye Nauki. 2, 153–168.

[11] Banguoğlu, T., 2004. Dil edatlari [Prepositions]. Isparta: Istanbul, Turkey. pp. 1–56.

[12] Makarenko, F.D., 2016. Rol, mesto i chastota upotrebleniya samostoyatel'nykh i sluzhebnykh chastey rechi v tekste [The role, place and frequency of the use of independent and official parts of speech in the text]. Molodoy Uchonyy. 2(106), 908–912.

[13] Plecháč, P., Kolár, R., 2015. The corpus of Czech verse. Studia metrica et poetica, 2(1), 107–118.

[14] Ormanova, A.B., Anafinova, M.L., 2022. Linguistic interference in information space terms: A corpus-based study in Kazakh. Theory and Practice in Language Studies. 12(12), 2497–2507. DOI: https://doi.org/10.17507/tpls.1212.04

[15] Baishukurova, G., Irgebayeva, A., Aitova, N., et al., 2024. The creation of concordance as an effective tool for studying the text: On the example of A. Baitursynov's concordance. Forum for Linguistic Studies. 6(5), 51–64. DOI: https://doi.org/10.30564/fls.v6i5.6856

[16] Mussakhojayeva, S., Khassanov, Y., Varol, H.A., 2022. KSC2: An industrial-scale open-source Kazakh speech corpus. Proceedings of the 23rd InterSpeech Conference; Incheon, South Korea, 18–22 September 2022. pp. 1367–1371.

[17] Aitova, N., Ospanova, D., 2024. Verb-based emotive structures in the linguistic corpus base. Toraygyrov University Bulletin. Philological Series. 1, 55–69.

[18] Hung-Yeh Tiee, H., 1979. The productive affixes in Mandarin Chinese morphology. Word. 30(3), 245–255. DOI: https://doi.org/10.1080/00437956.1979.11435670

[19] Zhanpeisov, E., 2002. Qazaq gramatikasy: fonetika, sözjasam, morfologia, sintaksis [Kazakh grammar: Phonetics, word formation, morphology, syntax]. Astana: Astana, Kazakhstan. pp. 1–132.

[20] Muhamedowa, R., 2016. Kazakh: A Comprehensive Grammar. Routledge: London, UK. pp. 1–299.

[21] Medetbekova, P.T., 2015. Linguostatistical analysis of conjunctions “Men” vs. “Zhäne”. Qazaq Universiteti: Almaty, Kazakhstan. pp. 1–124.

[22] Yuneev, V.V., 2007. Metaphorization of words in the texts of modern journalism [Candidate of Philological Sciences Thesis]. Moscow, Russia: State Pedagogical Institute. pp. 1–230.

[23] Ermukhamet, M., 2020. Mölsher kategoriya'synyn tarikhi paradigmacy (lingvistikalyk aspektide) [The historical paradigm of the measure category (in the linguistic aspect)] [Ph.D. Thesis]. Almaty, Kazakhstan: Al-Farabi Kazakh National University. pp. 1–160.

[24] Alkebaeva, D.A., 2020. Qazaq tilining pragmastilistikasy: Oqulyq [Pragmastilistics of the Kazakh language: Textbook]. Almaty, Kazakhstan: Qazaq Universiteti. pp. 1–62.

[25] Leech, G., Rayson, P., Wilson, A., 2001. Word Frequencies in Written and Spoken English: Based on the British National Corpus. Longman: London, UK. pp. 1–304.

[26] He, Y., Chow, J.Y.J., Nourinejad, M., 2017. A privacy design problem for sharing transport service tour data. Proceedings of IEEE ITS Conference; Yokohama, Japan, 16–19 October 2017. pp. 1–1359.

Downloads

How to Cite

Ormanova, A., Omarova, S., Ospanova, D., Aitova, N., Tokenkyzy, G., & Alshynbekova, M. (2025). A Corpus Approach in Language Discovery: A Word Frequency Analysis Based on the Corpus Outcomes in Kazakh. Forum for Linguistic Studies, 7(2), 869–881. https://doi.org/10.30564/fls.v7i2.8317