Automatic classification of emotions in speech: methods and data

Cover Page

Full Text

Abstract

The subject of this study is the data and methods used in the task of automatic recognition of emotions in spoken speech. This task has gained great popularity recently, primarily due to the emergence of large datasets of labeled data and the development of machine learning models. The classification of speech utterances is usually based on 6 archetypal emotions: anger, fear, surprise, joy, disgust and sadness. Most modern classification methods are based on machine learning and transformer models using a self-learning approach, in particular, models such as Wav2vec 2.0, HuBERT and WavLM, which are considered in this paper. English and Russian datasets of emotional speech, in particular, the datasets Dusha and RESD, are analyzed as data. As a method, an experiment was conducted in the form of comparing the results of Wav2vec 2.0, HuBERT and WavLM models applied to the relatively recently collected Russian datasets of emotional speech Dusha and RESD. The main purpose of the work is to analyze the availability and applicability of available data and approaches to recognizing emotions in speech for the Russian language, for which relatively little research has been conducted up to this point. The best result was demonstrated by the WavLM model on the Dusha dataset - 0.8782 dataset according to the Accuracy metric. The WavLM model also received the best result on the RESD dataset, while preliminary training was conducted for it on the Dusha - 0.81 dataset using the Accuracy metric. High classification results, primarily due to the quality and size of the collected Dusha dataset, indicate the prospects for further development of this area for the Russian language.

References

  1. Schneider, S., Alexei Baevski, Ronan Collobert, Auli, M. wav2vec: Unsupervised Pre-Training for Speech Recognition // ArXiv (Cornell University). 2019.
  2. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J. G. Emotion recognition in human-computer interaction // IEEE Signal Processing Magazine. 2001. V. 18. No. 1. Pp. 32–80.
  3. Kondratenko, V., Sokolov, A., Karpov, N., Kutuzov, O., Savushkin, N., Minkin, F. Large Raw Emotional Dataset with Aggregation Mechanism // ArXiv (Cornell University). 2022.
  4. Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., Narayanan, S. S. IEMOCAP: interactive emotional dyadic motion capture database // Language Resources and Evaluation. 2008. V. 42. No. 4, Pp. 335–359.
  5. Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., Schuller, B. W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2023. V. 45. No. 9. Pp. 10745-10759.
  6. Kossaifi, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M., Ringeval, F., Han, J., Pandit, V., Toisoul, A., Schuller, B., Star, K., Hajiyev, E., Pantic, M. SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021. V. 43. No. 3. Pp. 1022–1040.
  7. Mohamad Nezami, O., Jamshid Lou, P., Karami, M ShEMO: a large-scale validated database for Persian speech emotion detection // Language Resources and Evaluation. 2018. V. 53. No. 3. Pp. 1–16.
  8. Inger Samsø Engberg, Anya Varnich Hansen, Ove Kjeld Andersen, Dalsgaard, P. Design, recording and verification of a danish emotional speech database // EUROSPEECH. 1997. V. 4. Pp. 1695–1698.
  9. Hozjan, V., Kačič, Z. Context-Independent Multilingual Emotion Recognition from Speech Signals // International Journal of Speech Technology. 2003. V. 6. Pp. 311–320.
  10. Lotfian, R., Busso, C. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings // IEEE Transactions on Affective Computing. 2019. V. 10. No. 4. Pp. 471–483.
  11. Grimm, M., Kroschel, K., Narayanan, S. The Vera am Mittag German audio-visual emotional speech database // International Conference on Multimedia and Expo. 2008. Pp. 865–868.
  12. Livingstone, S. R., Russo, F. A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English // PLOS ONE. 2018. V. 13. No. 5.
  13. Lubenets, I., Davidchuk, N., Amentes, A. Aniemore. GitHub. 2022. URL: https://github.com/aniemore/Aniemore
  14. Andrew, A. M. An Introduction to Support Vector Machines and Other Kernel‐based Learning Methods // Kybernetes. 2001. V. 30. No. 1. Pp. 103–115.
  15. Ho, T. K. Random decision forests // Proceedings of 3rd international conference on document analysis and recognition. 1995. V. 1. Pp. 278–282.
  16. Ali, S., Tanweer, S., Khalid, S., Rao, N. Mel Frequency Cepstral Coefficient: A Review // ICIDSSD. 2021.
  17. Zheng, W. Q., Yu, J. S., Zou, Y. X. An experimental study of speech emotion recognition based on deep convolutional neural networks // 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). 2015. Pp. 827–831.
  18. Hochreiter, S., Schmidhuber, J. Long short-term memory // Neural computation. 1997. V. 9. No. 8. Pp. 1735–1780.
  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. Attention Is All You Need // ArXiv (Cornell University). 2017.
  20. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // ArXiv (Cornell University). 2018.
  21. Baevski, A., Zhou, H., Mohamed, A., Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations // ArXiv (Cornell University). 2020.
  22. Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021. V. 29. Pp. 3451–3460.
  23. Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing // IEEE Journal of Selected Topics in Signal Processing. 2022. V. 16. No. 6. Pp. 1505–1518.
  24. Jang, E., Gu, S., Poole, B. Categorical Reparametrization with Gumbel-Softmax // ArXiv (Cornell University). 2016.
  25. Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks // Communications of the ACM. 2012. V. 60. No.6. Pp. 84–90.
  26. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation // ArXiv (Cornell University). 2014.
  27. Yang, S., Chi, P. H., Chuang, Y. S., Lai, C. I. J., Lakhotia, K., Lin, Y. Y., Liu, A. T., Shi, J., Chang, X., Lin, G. T., Huang, T. H., Tseng, W. C., Lee, K., Liu, D. R., Huang, Z., Dong, S., Li, S. W., Watanabe, S., Mohamed, A., Lee, H. SUPERB: Speech processing Universal PERformance Benchmark // ArXiv (Cornell University). 2021.
  28. Chen, W., Xing, X., Xu, X., Pang, J., Du, L. SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023. V. 31. Pp. 775–788.

Supplementary files

Supplementary Files
Action
1. JATS XML

Согласие на обработку персональных данных

 

Используя сайт https://journals.rcsi.science, я (далее – «Пользователь» или «Субъект персональных данных») даю согласие на обработку персональных данных на этом сайте (текст Согласия) и на обработку персональных данных с помощью сервиса «Яндекс.Метрика» (текст Согласия).