A Corpus Approach in Language Discovery: A Word Frequency Analysis Based on the Corpus Outcomes in Kazakh


  • Sofya Omarova

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Dana Ospanova

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Nurlykhan Aitova

    Department of Kazakh Linguistics, Eurasian National University, Astana 010000, Kazakhstan

  • Gulnaz Tokenkyzy

    National Scientific and Practical Center «Til-Qazyna», Astana 010000, Kazakhstan

  • Assel Ormanova

    Department of General Education Disciplines, Astana IT University, Astana 010000, Kazakhstan

  • Madina Alshynbekova

    Branch campus of Beijing Language and Culture University (BLCU), Astana International University, Astana 010000, Kazakhstan


Received: 5 November 2024 | Revised: 20 January 2025 | Accepted: 23 January 2025 | Published Online: 20 February 2025


This study examines the most frequently used parts of speech and grammatical forms in the texts of the Sub-corpora of the National Corpus of the Kazakh Language (qazcorpora.kz). The frequency of word forms based on the 13-million-word usages in the 2023 corpus database was collected and analyzed both manually and using the functional setting of the corpus software. The study provided key insights into Kazakh journalistic texts’ frequency distribution, grammatical variability, and comparative patterns. The results indicated that: (1) conjunction ‘žäne’ [and], demonstrative pronoun ‘bul’ [this], auxiliary verb ‘dep’ [no translation], noun ‘Kazakh’ [Kazakh], modal verb ‘žoq’ [not], adjective ‘aq’ [white], adverb ‘köp’ [many/much], numeral ‘eki’ [two] showed the highest frequency indicators emphasizing their functional and stylistic roles in text construction in their word class. (2) functional words were the most frequently used part of speech. (3) conjunction ‘žäne’ [and], postposition ‘üšın’ [for] and particle ‘ɣana’ [only] possessed the highest frequency indicators among functional words. This corpus-based research highlights the alignment of Kazakh frequency patterns with global linguistic trends, such as Zipf’s law, while also showcasing unique features attributed to the language’s agglutinative nature.


National Corpus; Frequency Indicator; Part of Speech; Grammatical Form; Corpus Linguistics


