Abstract:To address the issues with current variations of Convolution Recurrent Networks (CRN), which often extract limited features, capture global characteristics poorly, and have large parameter sizes under single masking or mapping encoder-decoder structures, this paper proposes an efficient single-channel speech enhancement network. This network combines a multi-feature aggregation convolution module, leveraging complex spectrum joint masking and mapping, with an efficient Transformer-based attention mechanism. In the encoder-decoder layer, a Dual-branch Gated Cooperative Unit (DGCU) is designed to interact and aggregate multi-level complex spectral features, addressing the problem of singular feature extraction. The intermediate layer incorporates a Channel-Time-Frequency Attention Fusion Module, focusing on spatial and time-frequency local detail features of speech. Ablation and comparative experiments on the THCHS30 dataset demonstrate that this network achieves lightweight efficiency with the lowest parameter count and relatively low computational cost. It improves PESQ by 10. 5% ~ 50. 6% and 16. 3% ~ 94. 5% under matched and mismatched noise conditions, respectively. Both objective and subjective metrics outperform other comparative network models, exhibiting superior noise reduction performance and network generalization capability.