|
|
|
|
|
|
|
|
Audio-Visual soundscape stylization. We learn through self-supervision to manipulate input speech such that it sounds as though it were recorded within a given scene. |
Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded within a different scene, given an audio-visual conditional example recorded from that scene. Our model learns through self-supervision, taking advantage of the fact that natural video contains recurring sound events and textures. We extract an audio clip from a video and apply speech enhancement. We then train a latent diffusion model to recover the original speech, using another audio-visual clip taken from elsewhere in the video as a conditional hint. Through this process, the model learns to transfer the conditional example's sound properties to the input speech. We show that our model can be successfully trained using unlabeled, in-the-wild videos, and that an additional visual signal can improve its sound prediction abilities. |
|
We observe that the background noises and acoustic properties within a video tend to exhibit temporal coherence, especially when sound events occur repeatedly. Moreover, similar sound events often share semantically similar visual appearances. By providing the model with a conditional example from another time step in the video, the model is implicitly able to estimate the scene properties and transfer these to the input audio (e.g., the reverb and ambient background sounds). At test time, we will provide a clip taken from a different scene as conditioning, forcing the model to match the style of a desired scene. |
Street |
Train |
Bus |
Beach |
Mall |
Mountain |
Market |
Boat |
Tingle Li, Renhao Wang, Po-Yao Huang, Andrew Owens, Gopala Anumanchipalli Self-Supervised Audio-Visual Soundscape Stylization arXiv preprint |
Acknowledgements |