Aligning Audio-Visual Joint Representations with an Agentic Workflow

要約

ビジュアルコンテンツと付随するオーディオ信号は、オーディオビジュアル (AV) 関連アプリケーションを改善するために自然に共同表現を形成します。
研究ではさまざまな AV 表現学習フレームワークが開発されていますが、高品質の表現を実現するための AV データアライメントの重要性は通常損なわれています。
オーディオ信号にバックグラウンドノイズの干渉が含まれている可能性があることがわかりました。
また、オーディオストリームとビデオストリームの間に非同期が発生する場合があります。
これらの非厳密なデータ調整により表現品質が制限され、アプリケーションのパフォーマンスが低下します。
この論文では、オーディオ信号をビジュアルデータに合わせることで、データ中心の観点から AV ジョイント表現を改善することを提案します。
私たちの調整は、AVAgent という LLM ベースのアシスタントによって制御されるエージェントワークフローで実行されます。
AVAgent は、入力 AV データペアごとにマルチモーダル LLM を使用して、オーディオデータとビジュアルデータを個別に言語記述に変換します (つまり、ツールの使用)。
次に、AVAgent は、このペアになったデータが適切に調整されているかどうかを判断し、必要に応じてオーディオ信号を編集する計画を立てます (つまり、計画)。
オーディオ編集は、ノイズをフィルタリングしたりデータを増強したりする事前定義されたアクションによって実行されます。
さらに、VLM を使用して、変更されたオーディオ信号がビジュアルコンテンツとどのように一致するかを評価し、AVAgent にフィードバック (つまり、リフレクション) を提供します。
ツールの使用、計画、および反映のステップは周期的に動作して、オーディオ信号が徐々にビジュアルコンテンツに合わせられるエージェントワークフローになります。
この目的を達成するために、既存の方法では、エージェントワークフローを介して位置合わせされた AV データを直接利用して、AV ジョイント表現を改善できます。
実験結果は、さまざまな下流タスクにおける以前のベースラインに対する提案されたアプローチの最先端のパフォーマンスを包括的に示しています。

要約(オリジナル)

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

arxiv情報

著者	Shentong Mo,Yibing Song
発行日	2024-10-31 04:20:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning Audio-Visual Joint Representations with an Agentic Workflow

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー