Self-Supervised Audio-Visual Soundscape Stylization

ECCV 2024

Click on video to unmute/mute.

Audio-Visual soundscape stylization. We learn through self-supervision to manipulate input speech such that it sounds as though it were recorded within a given scene.

Abstract

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.


Generalization to Out-of-Domain Video

Click on video to unmute/mute. Use arrows to navigate.


Generalization to Non-Speech Input

Click on video to unmute/mute. Use arrows to navigate.


Stylization Results with In-Domain Examples

Click on video to unmute/mute. Use arrows to navigate.


Comparison to Uni-Modal Models


Soundscape Stylization by Conditional Speech De-Enhancement

We observe that the background noises and acoustic properties within a video tend to exhibit temporal coherence, especially when sound events occur repeatedly. Moreover, similar sound events often share semantically similar visual appearances. By providing the model with a conditional example from another time step in the video, the model is implicitly able to estimate the scene properties and transfer these to the input audio (e.g., the reverb and ambient background sounds). At test time, we will provide a clip taken from a different scene as conditioning, forcing the model to match the style of a desired scene.


Dataset

Click on video to unmute/mute. Use arrows to navigate.


Paper and Supplementary Material

Paper thumbnail

Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli

Self-Supervised Audio-Visual Soundscape Stylization

ECCV 2024

BibTeX

@inproceedings{li2024self,
  title={Self-Supervised Audio-Visual Soundscape Stylization},
  author={Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli},
  booktitle={ECCV},
  year={2024}
}

Acknowledgements

We thank Alexei A. Efros, Justin Salamon, Bryan Russell, Hao-Wen Dong, and Ziyang Chen for their helpful discussions and Baihe Huang for proofreading the paper. This work was funded in part by the Society of Hellman Fellowship and Sony Research Award.