AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

要約

複雑な世界に対する人間の知覚は、マルチモーダル信号の包括的な分析に依存しており、オーディオとビデオ信号の共起は、人間に豊かな手がかりを与えてくれる。本論文では、実世界における新しいオーディオビジュアルシーン合成に焦点を当てる。あるオーディオビジュアルシーンのビデオ録画が与えられたとき、タスクはそのオーディオビジュアルシーン内の任意の新しいカメラ軌道に沿った空間音声を持つ新しいビデオを合成することである。音声合成のためにNeRFベースのモデルを直接用いることは、事前知識と音響監督の欠如のために不十分である。この課題に取り組むため、我々はまず、音声伝搬に関する事前知識をNeRFに統合し、音声生成と視覚環境の3次元形状を関連付ける音響考慮型音声生成モジュールを提案する。さらに、音源に対する視線方向を表現する座標変換モジュールを提案する。このような方向変換は、モデルが音源を中心とした音場を学習するのに役立つ。さらに、頭部に関連するインパルス応答関数を利用して、擬似的な両耳音声を合成し、学習を強化するデータ補強を行う。我々は、実世界のオーディオビジュアルシーンにおいて、我々のモデルの優位性を定性的、定量的に実証する。読者の皆様には、我々のビデオ結果をご覧いただき、納得のいく比較をしていただければと思います。

要約(オリジナル)

Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues. This paper focuses on novel audio-visual scene synthesis in the real world. Given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that audio-visual scene. Directly using a NeRF-based model for audio synthesis is insufficient due to its lack of prior knowledge and acoustic supervision. To tackle the challenges, we first propose an acoustic-aware audio generation module that integrates our prior knowledge of audio propagation into NeRF, in which we associate audio generation with the 3D geometry of the visual environment. In addition, we propose a coordinate transformation module that expresses a viewing direction relative to the sound source. Such a direction transformation helps the model learn sound source-centric acoustic fields. Moreover, we utilize a head-related impulse response function to synthesize pseudo binaural audio for data augmentation that strengthens training. We qualitatively and quantitatively demonstrate the advantage of our model on real-world audio-visual scenes. We refer interested readers to view our video results for convincing comparisons.

arxiv情報

著者	Susan Liang,Chao Huang,Yapeng Tian,Anurag Kumar,Chenliang Xu
発行日	2023-02-07 17:38:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー