This webpage provides listening examples for our proposed Atss-Net and the baseline VoiceFilter[1]. The Atss-Net allows the network to compute the correlation between features in parallel and uses shallower layers to extract more features compared to the CNN-LSTM architecture. We include sample audio demonstrating Atss-Net's promising performance in speech enhancement.
Audio Samples from Test Set for Two Speakers' Mixture
Each card shows a mixture type: F-F (female-female), M-M (male-male), and M-F (male-female). Tracks are Mixture Input, VoiceFilter baseline, proposed Atss-Net, and Target Speech.
F-F: Female-Female Mixture
M-M: Male-Male Mixture
M-F (1): Male-Female Mixture
M-F (2): Male-Female Mixture
M-F (3): Male-Female Mixture
Speech Enhancement
Speech enhancement results with various background noise types. Each card shows the noisy input and Atss-Net output for both a female and male speaker against the same noise condition.
Chinese Woman
Foreign Man
Pure Music
Rap Music
Far-field TV
Reference
- Q. Wang, H. Muckenhirn, K. Wilson, et al., "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," in Interspeech, 2019, pp. 2728-2732.