Метод подавления акустического эха на основе рекуррентной нейронной сети и алгоритма кластеризации

Бесплатный доступ

В статье решается задача подавления акустического эха на основе нейронной сети оценивающей идеальную двоичную маску IBM из признаков, извлеченных из смеси сигналов ближнего и дальнего конца. Новизна предложенного метода заключается в использовании алгоритма кластеризации дополнительно с двунаправленной рекуррентной нейронной сетью BLSTM. Для оценки использования алгоритмов кластеризации EM, Mean-Shift, k-Means, модели были обучены и протестированы на базе данных TIMIT. Для каждой модели были вычислены метрики ERLE, PESQ, STOI, характеризующие ее качество. Использование алгоритмов кластеризации EM, Mean-Shift оказалось неэффективным по сравнению с алгоритмом BLSTM при соотношении сигнал/эхо 10 дБ. При соотношении сигнал/эхо 6 дБ BLSTM+Mean-Shift привел к незначительному улучшению метрики PESQ по сравнению с алгоритмом BLSTM. Результаты экспериментов показали эффективность предложенной модели BLSTM при использовании сети с алгоритмом K-Means, по сравнению с использованием чистой BLSTM для подавления эха в сценариях с двойным разговором. При соотношении сигнал/эхо 10 дБ метрика STOI, характеризующая разборчивость речи, улучшилась на 7%, а метрика PESQ, характеризующая качество восстановления речи, на 18.8%.

Еще

Идеальная двоичная маска, сигнал ближнего конца, сигнал дальнего конца, двунаправленная рекуррентная нейронная сеть, кластеризация, двойной разговор

Короткий адрес: https://sciup.org/147238110

IDR: 147238110   |   УДК: 004.032.26,   |   DOI: 10.14529/cmse220204

Method of an acoustic echo suppression based on recurrent neural network and clustering

The article solves the problem of acoustic echo suppression based on a neural network that evaluates an ideal binary mask IBM using features extracted from a mixture of near-end and far-end signals. The novelty of the proposed method lies in the use of the clustering algorithm in addition to the bidirectional recurrent neural network BLSTM. To evaluate the use of the EM, Mean-Shift, k-Means clustering algorithms, the models have been trained and tested on the TIMIT database. For each model, the ERLE, PESQ, STOI metrics have been calculated to characterize its quality. The use of the EM and Mean-Shift clustering algorithms appeared to be inefficient compared to the BLSTM algorithm at a signal-to-echo ratio of 10 dB. With a signal-to-echo ratio of 6 dB, BLSTM+Mean-Shift resulted in a marginal improvement in the PESQ metric compared to the BLSTM algorithm. The results of the experiments show the effectiveness of the proposed BLSTM model when using a network with the K-Means algorithm, compared to using a pure BLSTM for echo cancellation in double-talk scenarios. With a signal-to-echo ratio of 10 dB, the STOI metric, which characterizes speech intelligibility, has improved by 7%, and the PESQ metric, which characterizes the quality of speech restoration, by 18.8%.

Еще

Список литературы Метод подавления акустического эха на основе рекуррентной нейронной сети и алгоритма кластеризации

  • Benesty J., Jensen J., Christensen M., Chen J. Speech Enhancement: A Signal Subspace Perspective. Elsevier Academic Press, 2014. 129 p. DOI: 10.1016/C2013-0-16082-5.
  • Lee C.M., Shin J.W., Kim N.S. DNN-based residual echo suppression // Interspeech 2015, Dresden, Germany, September 6-10, 2015. ISCA, 2015. P. 1775-1779. DOI: 10.21437/ Interspeech.2015-412.
  • Zhang H., Wang D. Deep learning for acoustic echo cancellation in noisy and double-talk scenarios // Interspeech 2018, Hyderabad, India, September 2-6, 2018. ISCA, 2018. P. 3239-3243. DOI: 10.21437/Interspeech.2018-1484.
  • Zhang H., Tan K., Wang D. Deep learning for joint acoustic echo and noise cancellation with nonlinear distortions // Interspeech 2019, Graz, Austria, September 15-19, 2019. ISCA, 2019. P. 4255-4259. DOI: 10.21437/Interspeech.2019-2651.
  • Wang D. On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis // Speech Separation by Humans and Machines / ed. by P. Divenyi. Springer, Boston, MA, 2005. P. 181-197. DOI: 10.1007/0-387-22794-6_12.
  • Li N., Loizou P.C. Factors influencing intelligibility of ideal binary-masked speech: Implica^ tions for noise reduction // J. Acoust. Soc. Am. 2008. Vol. 123, no. 3. P. 1673-1682. DOI: 10.1121/1.2832617.
  • Brungart D.S., Chang P.S., Simpson B.D., Wang D. Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation // J. Acoust. Soc. Am. 2006. Vol. 120, no. 6. P. 4007-4018. DOI: 10.1121/1.2363929.
  • Benesty J., Gânsler T., Morgan D.R., et al. Advances in network and acoustic echo cancellation. Springer, Berlin, Heidelberg, 2001. 222 p. DOI: 10.1007/978-3-662-04437-7.
  • Enzner G., Buchner H., Favrot A., Kuech F. Chapter 30 - Acoustic Echo Control // Academic Press Library in Signal Processing: Volume 4 / ed. by J. Trussell, A. Srivastava, A.K. Roy-Chowdhury, et al. Elsevier, 2014. P. 807-877. DOI: 10 . 1016/B978- 0- 12-396501-1.00030-3.
  • Hamidia M., Amrouche A. A new robust double-talk detector based on the Stockwell transform for acoustic echo cancellation // Digital Signal Processing. 2017. Vol. 60. P. 99-112. DOI: 10.1016/j.dsp. 2016.09.001.
  • Ykhlef F., Ykhlef Н. A post-filter for acoustic echo cancellation in frequency domain // 2014 Second World Conference on Complex Systems (WCCS), Agadir, Morocco, Nov. 10-12, 2014. IEEE, 2014. P. 446-450. DOI: 10.1109/ICoCS.2014.7060938.
  • Kuech F., Kellermann W. Nonlinear residual echo suppression using a power filter model of the acoustic echo path // 2007 International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, Honolulu, HI, USA, April 15-20, 2007. IEEE, 2007. P. I-73-I-76. DOI: 10.1109/ICASSP. 2007.366619.
  • Malek J., Koldovsky Z. Hammerstein model-based nonlinear echo cancelation using a cascade of neural network and adaptive linear filter // 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, Sept. 13-16, 2016. IEEE, 2016. P. 1-5. DOI: 10.1109/IWAENC.2016.7602906.
  • Yang F., Wu M., Yang J. Stereophonic acoustic echo suppression based on wiener filter in the short-time fourier transform domain // EEE Signal Processing Letters. 2012. Vol. 19, no. 4. P. 227-230. DOI: 10.1109/LSP. 2012.2187446.
  • Wang D., Chen J. Supervised speech separation based on deep learning: an overview // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018. Vol. 26, no. 10. P. 1702 1726. DOI: 10.1109/TASLP.2018.2842159.
  • Wang Y., Narayanan A., Wang D. On training targets for supervised speech separation // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2014. Vol. 22, no. 12. P. 1849-1858. DOI: 10.1109/TASLP.2014.2352935.
  • Hochreiter S., Schmidhuber J. Long Short-Term Memory // Neural Computation. 1997. Vol. 9, no. 8. P. 1735-1780. DOI: 10.1162/neco.l997.9.8.1735.
  • Erdoğan H., Hershey J.R., Watanabe S., Roux J.L. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, April 19-24, 2015. IEEE, 2015. P. 708-712. DOI: 10.1109/ICASSP.2015.7178061.
  • Weninger F., Erdoğan H., Watanabe S., et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR // Latent Variable Analysis and Signal Separation. Vol. 9237 / ed. by E. Vincent, A. Yeredor, Z. Koldovsky, P. Tichavsky. Cham: Springer International Publishing, 2015. P. 91-99. Lecture Notes in Computer Science. DOI: 10.1007/978-3-319-22482-4_ll.
  • Chen J., Wang D. Long short-term memory for speaker generalization in supervised speech separation // The Journal of the Acoustical Society of America. 2017. Vol. 141, no. 6. P. 4705-4714. DOI: 10.1121/1.4986931.
  • Zermini A. Deep Learning for Speech Separation: PhD thesis / Zermini Alfredo. University of Surrey, faculty of engineering, physical sciences, Centre for Vision, Speech, Signal Processing (CVSSP), South East of England, UK, 2020. URL: https://openresearch. surrey.ac.uk/esploro/outputs/doctoral/99512310402346#file-0.
  • Xia S., Li H., Zhang X. Using Optimal Ratio Mask as Training Target for Supervised Speech Separation // 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, Dec. 12-15, 2017. IEEE, 2017. P. 163-166. DOI: 10.1109/APSIPA. 2017.8282021.
  • Palmqvist М. Methods and algorithms for quality and performance evaluation of audio conferencing systems: PhD thesis / Palmqvist Maria. Umeâ University, Faculty of Science, Technology, Department of Physics, Sweden, 2013. URL: http://umu.diva-portal.org/ smash/get/diva2:630382/FULLTEXT01.pdf.
  • ITU-T Recommendation P. 862, Perceptual Evaluation of Speech Quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. 2001. URL: https : //www . itu . int/rec/T-REC-P . 862-200102-I/en.
  • Fu S.-W., Liao C.-F., Tsao Y. Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality // EEE Signal Processing Letters. 2020. Vol. 27. P. 26-30. DOI: 10.1109/LSP. 2019.2953810.
  • Allen J.B., Berkley D.A. Image method for efficiently simulating small-room acoustics // The Journal of the Acoustical Society of America. 1998. Vol. 65, no. 4. P. 943-950. DOI: 10.1121/1.382599.
Еще