Self-Supervised Audio-Visual Soundscape Stylization

Tingle Li
Renhao Wang
Po-Yao Huang
Andrew Owens
Gopala Anumanchipalli
[Paper]
[Code]




Audio-Visual soundscape stylization. We learn through self-supervision to manipulate input speech such that it sounds as though it were recorded within a given scene.

Abstract

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities.


Generalization to Out-of-Domain Video







Generalization to Non-Speech Input







Stylization Results with In-Domain Examples




Comparison to Uni-Modal Models



Soundscape Stylization by Conditional Speech De-Enhancement

We observe that the background noises and acoustic properties within a video tend to exhibit temporal coherence, especially when sound events occur repeatedly. Moreover, similar sound events often share semantically similar visual appearances. By providing the model with a conditional example from another time step in the video, the model is implicitly able to estimate the scene properties and transfer these to the input audio (e.g., the reverb and ambient background sounds). At test time, we will provide a clip taken from a different scene as conditioning, forcing the model to match the style of a desired scene.


Dataset

Street
Train
Bus
Beach
Mall
Mountain
Market
Boat


 [Download Link]


Paper and Supplementary Material


Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli
Self-Supervised Audio-Visual Soundscape Stylization
arXiv preprint

[Bibtex]


Acknowledgements

We thank Alexei A. Efros, Justin Salamon, Bryan Russell, Hao-Wen Dong, and Ziyang Chen for their helpful discussions and Baihe Huang for proofreading the paper. This work was funded in part by the Society of Hellman Fellowship and Sony Research Award. This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.