Speech Emotion Recognition Using a Combination of Transformer and Convolutional Neural networks
Subject Areas : Renewable energyYousef Pourebrahim 1 , Farbod Razzazi 2 , Hossein Sameti 3
1 - Department of Electrical and Computer Engineering- Science and Research Branch, Islamic Azad University, Tehran, Iran
2 - Department of Electrical and Computer Engineering- Science and Research Branch, Islamic Azad University, Tehran, Iran
3 - Department of Computer Engineering-Sharif University of Technology, Tehran, Iran
Keywords: Classification, emotion recognition, Deep neural networks, Speech Signal Processing,
Abstract :
Speech emotions recognition due to its various applications has been considered by many researchers in recent years. With the extension of deep neural network training methods and their widespread usage in various applications. In this paper, the application of convolutional and transformer networks in a new combination in the recognition of speech emotions has been investigated, which is easier to implement than existing methods and has a good performance. For this purpose, basic convolutional neural networks and transformers are introduced and then based on them a new model resulting from the combination of convolutional networks and transformers is presented in which the output of the basic convolutional network is the input of the basic transformer network. The results show that the use of transformer neural networks in recognizing some emotional categories performs better than the convolutional neural network-based method. This paper also shows that the use of simple neural networks in combination can have a better performance in recognizing emotions through speech. In this regard, recognition of speech emotions using a combination of convolutional neural networks and a transformer called convolutional-transformer (CTF) for RAVDESS dataset achieved an accuracy of %80.94; while a simple convolutional neural network achieved an accuracy of about %72.7. The combination of simple neural networks can not only increase recognition accuracy but also reduce training time and the need for labeled training samples.
[1] K. Han, D. Yu, I. Tashev, "Speech emotion recognition using deep neural network and extreme learning machine", Proceeding of the ISCA, pp. 223-227, Singapore, Malaysia, Sept. 2014 (doi: 10.21437/Interspeech.2014-57).
[2] A. M. Badshah, J. Ahmad, N. Rahim, S.W. Baik, "Speech emotion recognition from spectrograms with deep convolutional neural network", Proceeding of the IEEE/PlatCon, pp. 1-5, Busan, South Korea, Feb. 2017 (doi: 10.1109/PlatCon.2017.7883728).
[3] S. Mittal, S. Agarwal, M.J. Nigam, "Real time multiple face recognition: A deep learning approach", Proceedings of the ICDMIP, pp. 70-76, Okinawa, Japan, Nov. 2018 (doi: 10.1145/3299852.3299853).
[4] H.S. Bae, H.J. Lee, S.G. Lee, "Voice recognition based on adaptive MFCC and deep learning", Proceeding of the IEEE/ICIEA, pp. 1542-1546, Hefei, China, June 2016 (doi:10.1109/ICIEA.2016.7603830).
[5] K. He, X. Zhang, S. Ren, J. Sun, "Deep residual learning for image recognition", Proceedings of the IEEE/CVPR, pp. 770-778, Las Vegas, NV, USA, June 2016 (doi: 10.1109/CVPR.2016.90).
[6] K.Y. Huang, C.H. Wu, Q.B. Hong, M.H. Su, Y.H. Chen, "Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds", Proceeding of the IEEE/ICASSP, pp. 5866-5870, Brighton, UK, May 2019 (doi: 10.1109/ICASSP.2019.8682283).
[7] W. Lim, D. Jang, T. Lee, "Speech emotion recognition using convolutional and recurrent neural networks", Proceeding of the IEEE/APSIPA, pp. 1-4, Jeju, Korea (South), Dec. 2016 (doi: 10.1109/APSIPA.2016.7820699).
[8] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou, "Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network", Proceeding of the IEEE/ICASSP, pp. 5200-5204, Shanghai, China, March 2016 (doi: 10.1109/ICASSP.2016.7472669)
[9] Y. Pourebrahim, F. Razzazi, H. Sameti, "Parallel shared hidden layers auto-encoder as a cross-corpus transfer learning approach for unsupervised persian speech emotion recognition", Signal Processing and Renewable Energy, 2021 (Accepted Manuscript).
[10] Y. Pourebrahim, F. Razzazi, H. Sameti, "Semi-supervised parallel shared encoders for speech emotion recognition", Digital Signal Processing, vol. 118, Article Number: 103205, Nov. 2021 (doi: 10.1016/j.dsp.2021.103205).
[11] N. Yazdanian, H. Mahmodian, "Emotion recognition of speech signals based on filter methods", Journal of Intelligent Procedures in Electrical Technology, vol. 7, no. 27, pp. 3-12, Dec. 2016 (dor: 20.1001.1.23223871.1395.7.27.1.4).
[12] M. Kadkhodaei Elyaderani, S.H. Mahmoodian, G. Sheikhi, "Wavelet packet entropy in speaker-independent emotional state detection from speech signal", Journal of Intelligent Procedures in Electrical Technology, vol. 5, no. 20, pp. 67-74, March 2015 (dor: 20.1001.1.23223871.1393.5.20.6.1).
[13] D. Issa, M.F. Demirci, A. Yazici, "Speech emotion recognition with deep convolutional neural networks", Biomedical Signal Processing and Control, vol. 59, Article Number: 101894, May 2020 (doi: 10.1016/j.bspc.2020.101894).
[14] J. Zhao, X. Mao, L. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks", Biomedical Signal Processing and Control, vol. 47, pp. 312-323, Jan. 2019 (doi: 10.1016/j.bspc.2018.08.035).
[15] S. Kwon, "A CNN-assisted enhanced audio signal processing for speech emotion recognition", Sensors, vol. 20, no. 1, Article Number: 183, Dec. 2020 (doi: 10.3390/s20010183).
[16] M. Farooq, F. Hussain, N.K. Baloch, F.R. Raja, H. Yu, Y.B. Zikria, "Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network", Sensors, vol. 20, no. 21, Article Number: 6008, Oct. 2020 (doi: 10.3390/s20216008).
[17] M. Sajjad, S. Kwon, "Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM", IEEE Access, vol. 8, pp. 79861-79875, April 2020 (doi: 10.1109/ACCESS.2020.2990405).
[18] M.S. Fahad, A. Ranjan, J. Yadav, A. Deepak, "A survey of speech emotion recognition in natural environment", Digital Signal Processing, Article Number: 102951, March 2020 (doi: 10.1016/j.dsp.2020.102951).
[19] A. Vaswani, N, Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, I. Polosukhin, "Attention is all you need", Advances in Neural Information Processing Systems, pp. 5998-6008, Dec. 2017.
[20] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, "Image transformer", Proceeding of the PLMR, pp. 4055-4064, Stockholm, Sweden, July 2018.
[21] D. Povey, H. Hadian, P. Ghahremani, K. Li, S. Khudanpur, "A time-restricted self-attention layer for ASR", Proceeding of the IEEE/ICASSP, pp. 5874-5878:, Calgary, AB, Canada, April 2018 (doi: 10.1109/ICASSP.2018.8462497).
[22] P.J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, N. Shazeer, "Generating wikipedia by summarizing long sequences", arXiv preprint, pp. 1-18, Jan. 2018.
[23] C. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A.M. Dai, M.D. Hoffman, D. Eck, "Music transformer", arXiv preprint, 2018.
[24] P. Shegokar, P. Sircar, "Continuous wavelet transform based speech emotion recognition", Proceeding of the IEEE/ICSPCS, pp. 1-8, Surfers Paradise, QLD, Australia, Dec. 2016 (doi: 10.1109/ICSPCS.2016.7843306).
[25] S.R. Livingstone, F.A. Russo, "The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english", Plosone, vol. 13, no. 5, Article Number: 0196391, 2018 (doi: 10.1371/journal.pone.0196391).
[26] B. Zhang, E.M. Provost, G. Essl, "Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach", Proceeding of the IEEE/ICASSP, pp. 5805-5809, Shanghai, China, March 2016 (doi: 10.1109/ICASSP.2016.7472790).
[27] Y. Zeng, H. Mao, D. Peng, Z. Yi, "Spectrogram based multi-task audio classification", Multimedia Tools and Applications, vol. 78, no. 3, pp. 3705-3722, Feb. 2019 (doi: 10.1007/s11042-017-5539-3).
[28] A.S. Popova, A.G. Rassadin, A.A. Ponomarenko, "Emotion recognition in sound", Proceeding of the ICN pp. 117-124, Moscow, Russia, Oct. 2017 (doi: 10.1007/978-3-319-66604-4_18).
[29] S. Kwon, "CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network", Mathematics, vol. 8, no. 12, Article Number: 2133, Nov. 2020 (doi: 10.3390/math8122133).
[30] F. Chollet, "Deep learning with python", New York, NY: Manning Publications, 2017.
[31] M.S. Seyfioğlu, A.M. Özbayoğlu, S.Z. Gürbüz, "Deep convolutional autoencoder for radar-based classification of similar aided and unaided human activities", IEEE Trans. on Aerospace and Electronic Systems, vol. 54, no. 4, pp. 1709-1723, Feb. 2018 (10.1109/TAES.2018.2799758).
[32] V. Verma, N. Agarwal, N. Khanna, "DCT-domain deep convolutional neural networks for multiple JPEG compression classification", Signal Processing: Image Communication, vol. 67, pp. 22-33, Sept. 2018 (doi: 10.1016/j.image.2018.04.014).
[33] A. Bhavan, P. Chauhan, R.R. Shah, "Bagged support vector machines for emotion recognition from speech", Knowledge-Based Systems, vol. 184, Article Number: 104886, Nov. 2019 (doi: 10.1016/j.knosys.2019.104886).
_||_