Sounding that Object: Interactive Object-Aware Image to Audio Generation

ICML 2025

Tingle Li¹ Baihe Huang¹ Xiaobin Zhuang² Dongya Jia²
Jiawei Chen² Yuping Wang² Zhuo Chen² Gopala Anumanchipalli¹ Yuxuan Wang²

¹ University of California, Berkeley ² ByteDance

Paper arXiv Code BibTex

TL;DR: We interactively generate sounds specific to user-selected objects within complex visual scenes.

Interactive object-aware audio generation. We generate sound aligned with specific visual objects in complex scenes. Users can select one or more objects in the scene using segmentation masks, and our model generates audio corresponding to the selected objects. Here, we show a busy street with multiple sound sources (left). After training, our model generates object-specific audio (right), such as crowd noise for people, engine sounds for cars, and blended audio for multiple objects.

Abstract

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Compositional Sound Generation

Sound Adaptation to Visual Texture Changes

Generation Results with In-Domain Examples

Model Architecture

We encode the reference spectrogram via a pre-trained latent encoder. An image and text prompt are processed by separate encoders, and their embeddings are fused using an attention mechanism to highlight relevant objects. We then feed these conditioned features and noisy latent into a latent diffusion model to generate the object-specific audio. Finally, the latent decoder reconstructs the spectrogram, and a pre-trained HiFi-GAN vocoder generates the final audio waveform. At test time, we replace the attention with a user-provided segmentation mask, and the latent encoder for the reference spectrogram is not used.

BibTeX

@inproceedings{li2025sounding,
  title     = {Sounding that Object: Interactive Object-Aware Image to Audio Generation},
  author    = {Li, Tingle and Huang, Baihe and Zhuang, Xiaobin and Jia, Dongya and Chen, Jiawei and Wang, Yuping and Chen, Zhuo and Anumanchipalli, Gopala and Wang, Yuxuan},
  booktitle = {ICML},
  year      = {2025},
}