A Gold Standard Dataset for Javanese Tokenization, POS Tagging, Morphological Feature Tagging, and Dependency Parsing

Authors

  • Ika Alfina

    Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia

  • Arlisa Yuliawati

    Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia

  • Dipta Tanaya

    Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia

  • Arawinda Dinakaramani

    Faculty of Computer Science, Universitas Indonesia, Depok 16424, Indonesia

  • Daniel Zeman

    Faculty of Mathematics and Physics, Charles University, Praha CZ-11800, Czechia

DOI:

https://doi.org/10.30564/fls.v6i5.6957
Received: 27 July 2024 | Revised: 16 August 2024 | Accepted: 19 August 2024 | Published Online: 5 November 2024

Abstract

Javanese, a regional language in Indonesia with more than 68 million speakers, is a low-resource language in the Natural Language Processing (NLP) field because it needs more language resources in both dataset and NLP tools. In this work, we developed a gold standard dataset of 1,000 sentences and 14,323 words for Javanese for four NLP tasks: tokenization, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing. This dataset is in the CoNLL-U format that conforms with the Universal Dependencies (UD) annotation guidelines. We involved native Javanese speakers as the annotators. Javanese sentences are taken from grammar books, Wikipedia, and online newspapers. We build models for tokenization, POS tagging, morphological feature tagging, and dependency parsing using UDPipe to evaluate the dataset's quality. The evaluation was conducted with the 10-fold cross-validation method. For the tokenization task, our model has an F1 score of 99.53%, 72.01%, 97.11%, and 95.90% for segmenting tokens, multiword tokens (MWT), syntactic words, and sentences, respectively. For POS and morphological feature tagging from gold tokenization, the model has an F1-score of 87.22% and 86.66% for POS tagging and morphological feature tagging. Finally, for the dependency parsing task, parsing from gold tokenization with gold tags has an Unlabeled Attachment Score (UAS) of 77.08% and a Labeled Attachment Score (LAS) of 71.21%.

Keywords:

Annotation Guidelines; Dependency Parsing; Low-Resource Language; Morphological Feature Tagging; POS Tagging; Tokenization; Universal Dependencies

References

[1] Eberhard, D.M., Simons, G.F., Fennig, C.D., 2022. Ethnologue: Languages of the world, 25th ed. SIL International: Dallas. pp. 1–760.

[2] Aji, A.F., Winata, G.I., Koto, F., et al., 2022. One country, 700+ Languages: NLP challenges for underrepresented languages and dialects in Indonesia. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.). Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics: Dublin, Ireland. pp. 7226–7249. DOI: https://doi.org/10.18653/v1/2022.acl-long.500

[3] Indurkhya, N., Damerau, F.J., 2010. Handbook of natural language processing, 2nd ed. Chapman and Hall: New York. pp. 1–704. DOI: https://doi.org/10.1201/9781420085938

[4] Krisnawati, L.D., Mahastama, A.W., 2018. A Javanese syllabifier based on its orthographic system. Proceedings of the International Conference on Asian Language Processing (IALP) 2018; Bandung, Indonesia; 15–17 November 2018. pp. 244–249. DOI: https://doi.org/10.1109/IALP.2018.8629173

[5] Wijono, S.H., Alhamidi, M.R., Hilman, M.H., et al., 2021. Canonical segmentation using affix characters as a unit on transformer for Javanese language. Proceeding of the 2021 6th International Workshop on Big Data and Information Security (IWBIS); Depok, Indonesia; 23–25 October 2021. pp. 67–72. DOI: https://doi.org/10.1109/IWBIS53353.2021.9631839

[6] Cahyani, D.E., Utami, L.M.T, Setiadi, H., 2019. Clustering of Javanese news in krama alus level with Javanese stemming. Proceeding of the 2019 International Conference on Information and Communications Technology (ICOIACT); Yogyakarta, Indonesia; 24–25 July 2019. pp. 462–467. DOI: https://doi.org/10.1109/ICOIACT46704.2019.8938438

[7] Nq, M.A., Manik, L.P., Widiyatmoko, D., 2020. Stemming Javanese: Another adaptation of the Nazief-Adriani algorithm. Proceeding of the 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI); 10–11 December 2020. pp. 627–631. DOI: https://doi.org/10.1109/ISRITI51436.2020.9315420

[8] Adriani, M., Asian, J., Nazief, B., et al., 2007. Stemming Indonesian: A confixed-stripping approach. ACM Transactions on Asian Language Information Processing (TALIP). 6(4), 1–33. DOI: https://doi.org/10.1145/1316457.1316459

[9] Ramadhan, F.A., Suryani, A.A., Bijaksana. M.A., 2020. Part of speech tagging in Javanese using support vector machine method. e-Proceeding of Engineering. 7(2), 1–8. Availabe from: https://jitl.web.id/index.php/engineering/article/view/13089

[10] Pramudita, H.R., Utami, E., Amborowati, A., 2016. Effects of rule-based part of speech tagging and distribution maximum entropy probability for Javanese krama. Jurnal Buana Informatika. 7(4), 235–244. DOI: https://doi.org/10.24002/jbi.v7i4.764

[11] Pratama, R.A., Suryani, A.A., Maharani, W., 2020. Part of speech tagging for javanese language with hidden markov model. Journal of Computer Science and Informatics Engineering (J-Cosine). 4(1), 84–91. DOI: https://doi.org/10.29303/jcosine.v4i1.346

[12] Zilziana, A., Suryani, A.A., Asror, I., 2020. Part of speech tagging for javanese using conditional random fields method. e-Proceeding of Engineering. 7(2), 8103–8111.

[13] Ratnaparkhi, A., 1996. A maximum entropy model for part-of-speech tagging. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP); Philadelphia, USA; 17–18 May 1996. pp. 133–142. Available from: https://aclanthology.org/W96-0213.pdf

[14] Rabiner, L., Juang, B., 1986. An introduction to hidden markov models. IEEE ASSP Magazine. 3(1), 4–16. DOI: https://doi.org/10.1109/MASSP.1986.1165342

[15] Hearst, M.A., Dumais, S.T., Osuna, E., et al., 1998. Support vector machines. IEEE Intelligent Systems and Their Applications. 13(4), 18–28. DOI: https://doi.org/10.1109/5254.708428

[16] Sutton, C., McCallum, A., 2012. An introduction to conditional random fields. Foundations and Trends® in Machine Learning. 4(4), 267–373. DOI: https://doi.org/10.1561/2200000013

[17] Nivre, J., de Marneffe, M.C., Ginter, F., et al., 2020. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Language Resources and Evaluation (LREC). pp. 4034–4043.

[18] Straka, M., 2018. UDPIPE 2.0 prototype at Conll 2018 UD shared task. Proceeding of the CoNLL 2018 - SIGNLL Conference on Computational Natural Language Learning. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies; Brussels, Belgium; 31 October–1 November 2018. pp. 197–207. DOI: https://doi.org/10.18653/v1/K18-2020

[19] Straka, M., Hajič, J., Straková, J., 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16); Portorož, Slovenia; 23–28 May 2016. pp. 4290–4297. Available from: https://aclanthology.org/L16-1680.pdf

[20] McDonald, R., Nivre, J., Quirmbach-brundage, Y., et al., 2013. Universal dependency annotation for multilingual parsing. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Sofia, Bulgaria; 4–9 August 2013. pp. 92–97. Available from: https://aclanthology.org/P13-2017.pdf

[21] Alfina, I., Budi, I., Suhartanto, H., 2020. Tree rotations for dependency trees: Converting the head-directionality of noun phrases. Journal of Computer Science. 16(11), 1585–1597. DOI: https://doi.org/10.3844/jcssp.2020.1585.1597

[22] Alfina, I., Dinakaramani, A., Fanany, M.I., et al., 2019. A gold standard dependency treebank for Indonesian. Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33); Hakodate, Japan; 13–15 September 2019. pp. 1–9. Available from: https://www.researchgate.net/publication/334470091_A_Gold_Standard_Dependency_Treebank_for_Indonesian

[23] Alfina, I., Zeman, D., Dinakaramani, A., et al., 2020. Selecting the UD v2 Morphological Features for Indonesian Dependency Treebank. Proceedings of the 2020 International Conference of Asian Language Processing (IALP); Kuala Lumpur, Malaysia; 4–6 December 2020. pp. 104–109. DOI: https://doi.org/10.1109/IALP51396.2020.9310513

[24] Zeman, D., Hajič, J., Popel, M., et al., 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In: Zeman, D., Hajič, J. (eds). Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics; Brussels, Belgium. pp. 1–21. DOI: https://doi.org/10.18653/v1/K18-2001

[25] Robson, S., 2014. Javanese grammar for students, a graded introduction, 3rd ed. Monash University Publishing: Clayton, Australia. pp.1–122.

[26] Suwadji. (2013). Ngoko Krama. Kementerian Pendidikan dan Kebudayaan Badan. Pengembangan dan Pembinaan Bahasa Balai Bahasa Provinsi Daerah Istimewa Yogyakarta.

[27] Wolff, J.U., Poedjosoedarmo, S., 1982. Communicative codes in central java (Volume 113–116). Southeast Asia Program, Department of Asian Studies, Cornell University: New York. pp. 1–197.

[28] Wedhawati, Nurlina, W.E.S., Setiyanto, E. (Eds.). 2006. Tata bahasa jawa mutakhir. Pusat Bahasa, Departemen Pendidikan Nasional: Jakarta. pp. 1–586.

[29] Adelaar, K.A., Himmelmann, N.(Eds.). 2004. The Austronesian languages of Asia and Madagascar. Routledge: London. pp. 1–864. DOI: https://doi.org/10.4324/9780203821121

[30] Tiedemann, J., 2012. Parallel data, tools and interfaces in OPUS. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012); Istanbul, Turkey; 21–27 May 2012. pp. 2214–2218. Available from: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

[31] Hanifmuti, M.Y., Alfina, I., 2020. Aksara: An Indonesian morphological analyzer that conforms to the UD v2 annotation guidelines. Proceedings of the 2020 International Conference of Asian Language Processing (IALP); Kuala Lumpur, Malaysia; 4–6 December 2020. pp. 86–91. DOI: https://doi.org/10.1109/IALP51396.2020.9310490

[32] Popel, M., Žabokrtský, Z., Vojtek, M., 2017. Udapi: Universal API for universal dependencies. Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017); Gothenburg, Sweden; 22 May 2017. pp. 96–101. Available from: https://aclanthology.org/W17-0412.pdf

[33] Devlin, J., Chang, M.W., Lee, K., et al., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Minneapolis, Minnesota; 2–7 June 2019. pp. 4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423

Downloads

How to Cite

Alfina, I., Yuliawati, A., Tanaya, D., Dinakaramani, A., & Zeman, D. (2024). A Gold Standard Dataset for Javanese Tokenization, POS Tagging, Morphological Feature Tagging, and Dependency Parsing. Forum for Linguistic Studies, 6(5), 131–148. https://doi.org/10.30564/fls.v6i5.6957