A Novel Framework for Text-Image Pair to Video Generation in Music Anime Douga (MAD) Production

Authors

  • Ziqian Luo

    Oracle, Seattle, WA 98101, USA

  • Feiyang Chen

    Coupang, Mountain View, 94043 CA, USA

  • Xiaoyang Chen

    The Ohio State University, 43210 OH, USA

  • Xueting Pan

    Oracle, Seattle, WA 98101, USA

DOI:

https://doi.org/10.30564/aia.v6i1.6848
Received: 25 June 2024 | Accepted: 18 July 2024 | Published Online: 31 July 2024

Abstract

The rapid growth of digital media has driven advancements in multimedia generation, notably in Music Anime Douga (MAD), which blends animation with music. Creating MADs currently requires extensive manual labor, particularly for designing critical frames. Existing methods like GANs and transformers excel at text-to-video synthesis but lack the precision needed for artistic control in MADs. They often neglect the crucial hand-drawn frames that form the visual foundation of these videos. This paper introduces a novel framework for generating high-quality videos from text-image pairs, addressing this gap. Our multi-modal system interprets narrative and visual inputs, generating seamless video outputs by integrating text-to-video and image-to-video synthesis. This approach enhances artistic control, preserving the creator's intent while streamlining the production process. Our framework democratizes MAD production, encouraging broader artistic participation and innovation. We provide a comprehensive review of existing research, detail our model's architecture, and validate its effectiveness through experiments. This study lays the groundwork for future advancements in AI-assisted MAD creation.

Keywords:

Multimodal; Image-text to video generation; Multimedia generation; Music Anime Douga; AI-assisted animation

References

[1] Zhou Y, Osman A, Willms M, et al., 2023. Semantic wireframe detection. Ndt. net DGZfP. 1–20.

[2] Wang, H., Zhou, Y., Pérez, E., et al., 2024. Jointly Learning Selection Matrices for Transmitters. Receivers and Fourier Coefficients in Multichannel Imaging. ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOI: https://doi.org/10.1109/icassp48485.2024.10448087

[3] Li M, He J, Jiang G, et al., 2024. Ddn-slam: Real-time dense dynamic neural implicit slam with joint semantic encoding. arXiv preprint arXiv: 2401.01545. DOI: https://doi.org/10.48550/arXiv.2401.01545

[4] Zhao, F., Yu, F., Trull, T., et al., 2023. A New Method Using LLMs for Keypoints Generation in Qualitative Data Analysis. 2023 IEEE Conference on Artificial Intelligence (CAI). DOI: https://doi.org/10.1109/cai54212.2023.00147

[5] Li, L., Li, Z., Guo, F., et al., 2024. Prototype Comparison Convolutional Networks for One-Shot Segmentation. IEEE Access. 12, 54978–54990. DOI: https://doi.org/10.1109/access.2024.3387742

[6] Y. Qiu, 2019. Estimation of tail risk measures in finance: Approaches to extreme value mixture modeling. Johns Hopkins University: MD.

[7] Xiong, S., Zhang, H., Wang, M., 2022. Ensemble Model of Attention Mechanism-Based DCGAN and Autoencoder for Noised OCR Classification. Journal of Electronic and Information Systems. 4(1), 33–41. DOI: https://doi.org/10.30564/jeis.v4i1.6725

[8] Xiong, S., Zhang, H., 2024. A Multi-model Fusion Strategy for Android Malware Detection Based on Machine Learning Algorithms. Journal of Computer Science Research. 6(2), 7–17. DOI: https://doi.org/10.30564/jcsr.v6i2.6632

[9] Ye , M., Zhou, H., Yang, H., et al., 2024. Multi-Strategy Improved Dung Beetle Optimization Algorithm and Its Applications. Biomimetics. 9(5), 291. DOI: https://doi.org/10.3390/biomimetics9050291

[10] Liu, Y., Yang, H., Wu, C., 2023. Unveiling Patterns: A Study on Semi-Supervised Classification of Strip Surface Defects. IEEE Access. 11, 119933–119946. DOI: https://doi.org/10.1109/access.2023.3326843

[11] Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2014. Generative adversarial nets. Advances in neural information processing systems. 27.

[12] Qiu, Y., Wang, J., 2024. A Machine Learning Approach to Credit Card Customer Segmentation for Economic Stability. Proceedings of the 4th International Conference on Economic Management and Big Data Applications, ICEMBDA 2023, October 27–29, 2023. Tianjin, China. DOI: https://doi.org/10.4108/eai.27-10-2023.2342007

[13] Li, S., Kou, P., Ma, M., et al., 2024. Application of Semi-Supervised Learning in Image Classification: Research on Fusion of Labeled and Unlabeled Data. IEEE Access. 12, 27331–27343. DOI: https://doi.org/10.1109/access.2024.3367772

[14] Zhao F, Yu F., 2024. Enhancing Multi-Class News Classification through Bert-Augmented Prompt Engineering in Large Language Models: A Novel Approach. In the 10th International scientific and practical conference "Problems and prospects of modern science and education"(March 12–15, 2024) Stockholm, Sweden. International Science Group. p. 297.

[15] Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Advances in neural information processing systems. 30.

[16] Chen F, Luo Z, Xu Y, et al., 2019. Complementary fusion of multi-features and multi-modalities in sentiment analysis. arXiv preprint arXiv:1904.08138. DOI: https://doi.org/10.48550/arXiv.1904.08138

[17] Luo, Z., Xu, H., Chen. F., 2018. Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network. EasyChair Preprints. DOI: https://doi.org/10.29007/7mhj

[18] Luo, Z., Zeng, X., Bao, Z.. et al., 2019. Deep Learning-Based Strategy For Macromolecules Classification with Imbalanced Data from Cellular Electron Cryotomography. 2019 International Joint Conference on Neural Networks (IJCNN). DOI: https://doi.org/10.1109/ijcnn.2019.8851972

[19] Luo, Z.. 2023. Knowledge-guided Aspect-based Summarization. 2023 International Conference on Communications, Computing and Artificial Intelligence (CCCAI). DOI: https://doi.org/10.1109/cccai59026.2023.00012

[20] Chen F, Luo Z., 2019. Sentiment Analysis using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities. CoRR abs.

[21] Chen F, Luo Z., 2018. Learning robust heterogeneous signal features from parallel neural network for audio sentiment analysis. arXiv preprint arXiv:1811.08065. DOI: https://doi.org/10.48550/arXiv.1811.08065

[22] Luo Z, Xu H, Chen F., 2018. Utterance-based audio sentiment analysis learned by a parallel combination of cnn and lstm. arXiv preprint arXiv:1811.08065.

[23] Chen, F., Luo, Z., Zhou, L., et al., 2024. Comprehensive Survey of Model Compression and Speed up for Vision Transformers. Journal of Information. Technology and Policy. 1–12. DOI: https://doi.org/10.62836/jitp.v1i1.156

[24] Pan, X., Luo, Z., Zhou, L.. 2022. Comprehensive Survey of State-of-the-Art Convolutional Neural Network Architectures and Their Applications in Image Classification. Innovations in Applied Engineering and Technology. 1–16. DOI: https://doi.org/10.62836/iaet.v1i1.1006

[25] Zhou L, Luo Z, Pan X., 2024. Machine learning-based system reliability analysis with Gaussian Process Regression. arXiv preprint arXiv:2403.11125. DOI: https://doi.org/10.48550/arXiv.2403.11125

[26] Pan, X., Luo, Z., Zhou, L., 2023. Navigating the Landscape of Distributed File Systems: Architectures, Implementations. and Considerations. Innovations in Applied Engineering and Technology, 1–12. DOI: https://doi.org/10.62836/iaet.v2i1.157

[27] Chen F, Chen N, Mao H, et al., 2018. Assessing four neural networks on handwritten digit recognition dataset (MNIST). arXiv preprint arXiv:1811.08278. DOI: https://doi.org/10.48550/arXiv.1811.08278

[28] Liu, Y., Bao, Y., 2023. Intelligent monitoring of spatially-distributed cracks using distributed fiber optic sensors assisted by deep learning. Measurement. 220, 113418. DOI: https://doi.org/10.1016/j.measurement.2023.113418

[29] Zhao, Y., Dai, W., Wang, Z., et al., 2024. Application of computer simulation to model transient vibration responses of GPLs reinforced doubly curved concrete panel under instantaneous heating. Materials Today Communications. 38, 107949. DOI: https://doi.org/10.1016/j.mtcomm.2023.107949

[30] Liu, Y., Bao, Y., 2022. Review on automated condition assessment of pipelines with machine learning. Advanced Engineering Informatics. 53, 101687. DOI: https://doi.org/10.1016/j.aei.2022.101687

Downloads

How to Cite

Luo, Z., Chen, F., Chen, X., & Pan, X. (2024). A Novel Framework for Text-Image Pair to Video Generation in Music Anime Douga (MAD) Production. Artificial Intelligence Advances, 6(1), 25–33. https://doi.org/10.30564/aia.v6i1.6848

Issue

Article Type

Article