Atss-Net: Target Speaker Separation via Attention-based Neural Network

This webpage provides listening examples for our proposed Atss-Net and the baseline VoiceFilter^[1]. The Atss-Net allows the network to compute the correlation between features in parallel and uses shallower layers to extract more features compared to the CNN-LSTM architecture. We include sample audio demonstrating Atss-Net’s promising performance in speech enhancement.

Audio Samples from Test Set for Two Speakers' Mixture

Types:

F-F: female-female mixture speech
M-M: male-male mixture speech
M-F: male-female mixture speech

Type	Mixture Input	VoiceFilter^[1]	Proposed Atss-Net	Target Speech
F-F
M-M
M-F
M-F
M-F

Speech Enhancement

Types:

C-W: mixed with Chinese Women music
F-M: mixed with Foreign Man music
M-F: P-M: mixed with Pure Music
R-M: mixed with Rap Music
F-T: mixed with Far-field TV

Type	Female Input	Proposed Atss-Net	Male Input	Proposed Atss-Net
C-W
F-M
P-M
R-M
F-T

Reference

[1] Q. Wang, H. Muckenhirn, K. Wilson, et al., "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," in Interspeech, 2019, pp. 2728-2732.