Atss-Net: Target Speaker Separation via Attention-based Neural Network

This webpage provides listening examples for our proposed Atss-Net and the baseline VoiceFilter^[1]. The Atss-Net allows the network to compute the correlation between features in parallel and uses shallower layers to extract more features compared to the CNN-LSTM architecture. We include sample audio demonstrating Atss-Net's promising performance in speech enhancement.

Audio Samples from Test Set for Two Speakers' Mixture

Each card shows a mixture type: F-F (female-female), M-M (male-male), and M-F (male-female). Tracks are Mixture Input, VoiceFilter baseline, proposed Atss-Net, and Target Speech.

F-F: Female-Female Mixture

Mixture Input

VoiceFilter

Atss-Net

Target

M-M: Male-Male Mixture

Mixture Input

VoiceFilter

Atss-Net

Target

M-F (1): Male-Female Mixture

Mixture Input

VoiceFilter

Atss-Net

Target

M-F (2): Male-Female Mixture

Mixture Input

VoiceFilter

Atss-Net

Target

M-F (3): Male-Female Mixture

Mixture Input

VoiceFilter

Atss-Net

Target

Speech Enhancement

Speech enhancement results with various background noise types. Each card shows the noisy input and Atss-Net output for both a female and male speaker against the same noise condition.

Chinese Woman

Female Input

Female Atss-Net

Male Input

Male Atss-Net

Foreign Man

Female Input

Female Atss-Net

Male Input

Male Atss-Net

Pure Music

Female Input

Female Atss-Net

Male Input

Male Atss-Net

Rap Music

Female Input

Female Atss-Net

Male Input

Male Atss-Net

Far-field TV

Female Input

Female Atss-Net

Male Input

Male Atss-Net

Reference

Q. Wang, H. Muckenhirn, K. Wilson, et al., "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," in Interspeech, 2019, pp. 2728-2732.