Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using TS-Conformer
Автор: Hanna Deepa Mallolu, Sunny Dayal Vanambathina
Журнал: International Journal of Image, Graphics and Signal Processing @ijigsp
Статья в выпуске: 3 vol.18, 2026 года.
Бесплатный доступ
Transformers, while powerful in capturing long-range dependencies with self-attention mechanisms, face several limitations in speech processing tasks. Moreover, transformers can lack inherent inductive biases to efficiently model local and fine-grained temporal and spectral structures critical for speech perception, resulting in suboptimal handling of fine details. To address this issue, this paper introduces a speech enhancement (SE) network that builds on a two-branch nested U-Net framework integrated with a two-stage conformer (TS-Conformer) for robust speech enhancement. The nested U-Net employs dual decoding branches for simultaneous spectral mapping and mask estimation, enabling complementary learning of speech characteristics. The TS-Conformer sequentially models temporal and frequency dependencies to improve contextual representation while maintaining local continuity. In addition, a complex feature extraction unit (CFEU-i) is incorporated to enhance multi-scale feature learning in the complex domain. By combining hierarchical feature extraction with sequential spectro-temporal modeling, the proposed method effectively suppresses noise while preserving speech quality. Experimental results demonstrate that the proposed NUNet-Conformer effectively achieves superior performance compared to recent SE approaches in terms of Signal-to-Distortion Ratio(SDR), Short-Time Objective Intelligibility(STOI), and Perceptual Evaluation of Speech Quality (PESQ).
Speech Enhancement, TS-Conformer, Nested U-Net, Two-Branch Decoding 1. Introduction
Короткий адрес: https://sciup.org/15020411
IDR: 15020411 | DOI: 10.5815/ijigsp.2026.03.07