-
3144
-
2212
-
1470
-
1184
-
986
A Multimodal Approach to Language Identification in Sotho-Tswana Musical Videos
DOI:
https://doi.org/10.30564/fls.v7i1.7623Abstract
Language plays a crucial role in Sotho-Tswana musical videos, as it helps determine the sentiment and genre. The Sotho-Tswana languages, spoken in parts of Southern Africa, are used to compose many indigenous songs and music. However, speakers of one of the Sotho-Tswana languages may not understand other Sotho-Tswana languages. Given the widespread availability of these musical videos on social media platforms, there is a need for appropriate recommendations for users based on the language used in the videos. While traditional language identification in music has focused on audio, music information for identifying the singing language can also be embedded in other modalities, such as visual and text. This study employs a multimodal approach to identify the singing language in Sotho-Tswana musical videos. The multimodal approach focuses on three modalities, visual, audio, and textual/lyrics. A multimodal dataset of Sotho-Tswana musical videos is used to train deep learning and language models, for each of the modalities. After the independent training, for each of the modalities, a decision-level (late) fusion method is used to combine the results of the training from the three modalities. The results demonstrate that a multimodal approach outperforms single-modality methods, such as those relying solely on lyrics or textual information.
Keywords:
Singing Language Identification; Multimodal; Audio Modality; Visual Modality; Deep LearningReferences
[1] Mehrabani, M., Hansen, J.H.L., 2011. Language Identification for Singing. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 22 May-27 May 2011; Prague, Czech Republic. pp. 4408–4411.
[2] Tsai, W.H., Wang, H.M., 2004. Towards Automatic Identification of Singing Language in Popular Music Recordings. In International Society for Music Information Retrieval, October 2004, Available from: https://homepage.iis.sinica.edu.tw/papers/whm/1384-F.pdf
[3] He, H., Jin, J., Xiong, Y., et al., 2008. Language Feature Mining for Music Emotion Classification via Supervised Learning from Lyrics. Proceedings of Advances in the 3rd International Symposium on Computation and Intelligence (ISICA 2008); 19–21 December 2008; Wuhan, China. pp. 426–435.
[4] Chen, Z., Liu, C., 2021. Music Audio Sentiment Classification Based on CNN-BiLSTM and Attention Model. IEEE 4th International Conference on Robotics, Control and Automation Engineering; 04 November–06 November 2021; Wuhan, China. pp. 156–160.
[5] Pasrija, S., Sahu, S., Meena, S., 2023. Audio Based Music Genre Classification using Convolutional Neural Networks Sequential Model. IEEE 8th International Conference for Convergence in Technology (I2CT); 7–9 April 2023; Pune, India. pp. 156–160.
[6] Choi, K., ByteDance, Y.W., 2021. Listen, Read and Identify: Multimodal Singing Language Identification of Music. Proceeding of the 22nd International Society for Music Information Retrieval Conference; arXiv preprint arXiv:2103.01893. pp. 1–7.
[7] Mukherjee, H., Dhar, A., Obaidullah, S.M., et al., 2021. Identifying Language from Songs. Multimedia Tools and Applications. 80(28), 35319–35339.
[8] Bhanja, C.C., Laskar, M.A., Laskar, R.H., et al., 2022. Deep Neural Network Based Two-Stage Indian Language Identification System Using Glottal Closure Instants as Anchor Points. Journal of King Saud University – Computer and Information Sciences. 34(4), 1439–1454
[9] Renault, L., Vaglio, A., Hennequin, R., 2021. Singing Language Identification Using Deep Phonotactic Approach. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 6–11 June 2021; Toronto, ON, Canada. pp. 271–275.
[10] Chavula, C., Suleman, H., 2021. Ranking by Language Similarity for Resource Scarce Southern Bantu Languages. Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’21), 11 July 2021; Virtual Event, Canada. ACM, New York, USA. pp. 1–11.
[11] Liu, Z., Richardson, C., Hatcher,, R., Jr., et al., 2022. Not Always About You: Prioritizing Community Needs When Developing Endangered Language Technology. arXiv preprint arXiv:2204.05541.
[12] Oguike, O., Primus, M., 2024. A Dataset for Multimodal Music Information Retrieval of Sotho-Tswana Musical Videos. Data in Brief. 55, 110672.
[13] Sharimbaev, B., Kadyrov, S., 2023. Automatic Language Identification from Audio Signals Using LSTM-RNN. 17th International Conference on Electronics Computer and Computation (ICECCO); 1–2 June 2023; Kaskelen, Kazakhstan. pp. 1-5.
[14] Bhola, A., Reddy, K.N., Kumar, M.J., 2023. Language Identification using Multi-Layer Perceptron. International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES); 28–30 April 2023; Greater Noida, India; pp. 1018–1022.
[15] Farris, D., White, C., Khudanpur, S., 2008. Sample Selection for Automatic Language Identification. IEEE International Conference on Acoustics, Speech and Signal Processing; 31 March–4 April 2008; Las Vegas, NV, USA. pp. 4225–4228.
[16] Vashishth, S., Bharadwaj, S., Ganapathy, S., et al., 2023. Label Aware Speech Representation Learning for Language Identification. Proceedings of Interspeech; arXiv preprint arXiv:2306.04374. pp. 5351–5355.
[17] Zissman, M.A., 1996. Comparison of Four Approaches to Automatic Language Identification of Telephone Speech. IEEE Transactions on Speech and Audio Processing. 4(1), 31–44.
[18] Carrasquillo, P.A.T., Singer, E., Kohler, M.A., et al., 2002. Approaches to Language Identification Using Gaussian Mixture Models and Shifted Delta Cepstral Features. Proceeding of International Conference on Spoken Language Processing; 16 September 2002; Denver, USA. pp. 89–92.
[19] Sugiyama, M., 1991. Automatic Language Recognition Using Acoustic Features. Proceedings ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing; 14–17 April 1991; Toronto, ON, Canada. pp. 813–816.
[20] Adeeba, F., Hussain, S., 2019. Native Language Identification in Short Utterances Using Bidirectional Long Short-Term Memory Network. IEEE Access. 7, 17098–17110.
[21] Aysa, Z., Ablimit, M., Hamdulla, A., 2023. Multi-Scale Feature Learning for Language Identification of Overlapped Speech. Applied Science. 13(7), 4235. DOI: https://doi.org/10.3390/app13074235
[22] Do, H.D., Chau, D.T., Tran, S.T., 2023. Speech Feature Extraction Using Linear Chirplet Transform and Its Applications. Journal of Information and Telecommunication. 7(3), 376–391. DOI: https://doi.org/10.1080/24751839.2023.2207267
[23] Li, Z., Zhao, M., Li, J., et al., 2020. On the Usage of Multi-Feature Integration for Speaker Verification and Language Identification. Proceedings of Interspeech 2020; 25–29 October 2020; Shanghai, China. pp 457–461.
[24] Tan, Z., Wang, D., Chen, Y., et al., 2018. Phonetic Temporal Neural Model for Language Identification. IEEE/ACM Transaction on Audio, Speech and Language Processing. 26(1), 134–141.
[25] Chandrasekhar, V., Sargin, M.E., Ross, D.A., 2011. Automatic Language Identification in Music Videos with Low Level Audio and Visual Features. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5724–5727.
[26] Lee, W.J., Coviello, E., 2022. A Multimodal Strategy for Singing Language Identification. Proceedings of Interspeech 2022; 18–22 September 2022; Incheon, Korea. pp. 2243–2247.
[27] Schwenninger, J., Brueckner, R., Willett, D., et al., 2006. Language Identification in Vocal Music. Proceeding of International Society for Music Information Retrieval, (ISMIR 2006). pp.377–379.
[28] Kruspe, A.M., Abesser, J., Dittma. C., 2014. A GMM Approach to Singing Language Identification. AES 53RD International Conference; 27–29 January 2014; London, UK. pp.1–9.
[29] Li, H., Ma, B., Lee, K.A., 2013. Spoken Language Recognition: From Fundamentals to Practice. Proceeding of the IEEE. 101(5), 1136–1159.
[30] Shivakumar, P.G., Chakravarthula, S.N., Georgiou, P.G., 2016. Multimodal Fusion of Multirate Acoustic, Prosodic, and Lexical Speaker Characteristics for Native Language Identification. Proceedings of Interspeech September 2016. pp. 2408–2412.
[31] Oguike, O., Primus, M., 2024. A Dataset for Multimodal Music Information Retrieval of Sotho-Tswana Music Videos. Data in Brief. 55, 110672.
[32] Gandhi, A., Adhvaryu, K., Poria, S., et al., 2023. Multimodal Sentiment Analysis: A Systematic Review of History, Datasets, Multimodal Fusion Methods, Applications, Challenges, and Future Directions. Information Fusion. 91, 424–444.
[33] Sarker, S., Tushar, S.N.B., Chen, H., 2023. High Accuracy Keyway Angle Identification Using VGG16-Based Learning Method. Journal of Manufacturing Processes. 98, 223–233.
Downloads
How to Cite
Issue
Article Type
License
Copyright © 2025 Osondu Everestus Oguike, Mpho Primus
This is an open access article under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.