Atss-Net: Target Speaker Separation via Attention-based Neural Network

This webpage provides listening examples for our proposed Atss-Net and the baseline VoiceFilter[1]. The Atss-Net allows the network to compute the correlation between features in parallel and uses shallower layers to extract more features compared to the CNN-LSTM architecture. We include sample audio demonstrating Atss-Net’s promising performance in speech enhancement.

Audio Samples from Test Set for Two Speakers' Mixture

Types:

Type Mixture Input VoiceFilter[1] Proposed Atss-Net Target Speech
F-F
M-M
M-F
M-F
M-F

Speech Enhancement

Types:

Type Female Input Proposed Atss-Net Male Input Proposed Atss-Net
C-W
F-M
P-M
R-M
F-T

Reference

[1] Q. Wang, H. Muckenhirn, K. Wilson, et al., "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," in Interspeech, 2019, pp. 2728-2732.