Atss-Net: Target Speaker Separation via Attention-based Neural Network

This webpage is to show some listening examples for our proposed Atss-Net and the baseline VoiceFilter[1]. It allows the network computing the correlation between each feature parallelly, and using shallower layers to extract more features, compared with the CNN-LSTM architecture. We also provide some samples to prove that our Atss-Net demonstrates promising performance in speech enhancement.

Audio samples from the test set for two speakers' mixture

Type:

  1. F-F: female-female mixture speech
  2. M-M: male-male mixture speech
  3. M-F: male-female mixture speech
Type Mixture Input VoiceFilter[1] Proposed Atss-Net Target Speech
F-F
M-M
M-F
M-F
M-F

Speech Enhancement

Type:

  1. C-W: mixed with Chinese Women music
  2. F-M: mixed with Foreign Man music
  3. P-M: mixed with Pure Music
  4. R-M: mixed with Rap Music
  5. F-T: mixed with Far-field TV
Type Female Input Proposed Atss-Net Male Input Proposed Atss-Net
C-W
F-M
P-M
R-M
F-T

Reference

[1] Q. Wang, H. Muckenhirn, K. Wilson, et al., "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," in Interspeech, 2019, pp. 2728-2732.