Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using TS-Conformer

Hanna Deepa Mallolu; Sunny Dayal Vanambathina

doi:10.5815/ijigsp.2026.03.07

Scientific articles \ Applied sciences. Medicine. Technology \ Engineering. Technology in general \ Mechanical engineering in general. Nuclear technology. Electrical engineering. Machinery \ Electrical engineering

Speech Enhancement Based on a Two-Branch Nested U-Net Architecture Using TS-Conformer

Автор: Hanna Deepa Mallolu, Sunny Dayal Vanambathina

Журнал: International Journal of Image, Graphics and Signal Processing @ijigsp

Статья в выпуске: 3 vol.18, 2026 года.

Бесплатный доступ

Transformers, while powerful in capturing long-range dependencies with self-attention mechanisms, face several limitations in speech processing tasks. Moreover, transformers can lack inherent inductive biases to efficiently model local and fine-grained temporal and spectral structures critical for speech perception, resulting in suboptimal handling of fine details. To address this issue, this paper introduces a speech enhancement (SE) network that builds on a two-branch nested U-Net framework integrated with a two-stage conformer (TS-Conformer) for robust speech enhancement. The nested U-Net employs dual decoding branches for simultaneous spectral mapping and mask estimation, enabling complementary learning of speech characteristics. The TS-Conformer sequentially models temporal and frequency dependencies to improve contextual representation while maintaining local continuity. In addition, a complex feature extraction unit (CFEU-i) is incorporated to enhance multi-scale feature learning in the complex domain. By combining hierarchical feature extraction with sequential spectro-temporal modeling, the proposed method effectively suppresses noise while preserving speech quality. Experimental results demonstrate that the proposed NUNet-Conformer effectively achieves superior performance compared to recent SE approaches in terms of Signal-to-Distortion Ratio(SDR), Short-Time Objective Intelligibility(STOI), and Perceptual Evaluation of Speech Quality (PESQ).

Speech Enhancement, TS-Conformer, Nested U-Net, Two-Branch Decoding 1. Introduction

Короткий адрес: https://sciup.org/15020411

IDR: 15020411 | DOI: 10.5815/ijigsp.2026.03.07