Self-Supervised Audio-Visual Soundscape Stylization

Tingle Li

Renhao Wang

Po-Yao Huang

Andrew Owens

Gopala Anumanchipalli

[Paper]

[Code]

Audio-Visual soundscape stylization. We learn through self-supervision to manipulate input speech such that it sounds as though it were recorded within a given scene.

Abstract

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.

Generalization to Out-of-Domain Video

Generalization to Non-Speech Input

Stylization Results with In-Domain Examples

Comparison to Uni-Modal Models

Soundscape Stylization by Conditional Speech De-Enhancement

We observe that the background noises and acoustic properties within a video tend to exhibit temporal coherence, especially when sound events occur repeatedly. Moreover, similar sound events often share semantically similar visual appearances. By providing the model with a conditional example from another time step in the video, the model is implicitly able to estimate the scene properties and transfer these to the input audio (e.g., the reverb and ambient background sounds). At test time, we will provide a clip taken from a different scene as conditioning, forcing the model to match the style of a desired scene.

Dataset

Street	Train	Bus	Beach
Mall	Mountain	Market	Boat

[Download Link]

Paper and Supplementary Material

Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli
Self-Supervised Audio-Visual Soundscape Stylization
arXiv preprint

[Bibtex]

Acknowledgements

We thank Alexei A. Efros, Justin Salamon, Bryan Russell, Hao-Wen Dong, and Ziyang Chen for their helpful discussions and Baihe Huang for proofreading the paper. This work was funded in part by the Society of Hellman Fellowship and Sony Research Award. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.