Question-Aware Gaussian Experts for Audio-Visual Question Answering

要約

視聴覚質問応答（AVQA）には、質問に基づいたマルチモーダル推論だけでなく、正確な予測のために微妙なダイナミクスをキャプチャするための正確な時間的接地も必要です。
ただし、既存の方法は主に質問情報を暗黙的に使用し、質問固有の詳細に焦点を当てます。
さらに、ほとんどの研究は、重要な質問関連フレームを見逃す可能性がある均一なフレームサンプリングに依存しています。
最近のTop-Kフレーム選択方法はこれに対処することを目的としていますが、それらの個別の性質は依然としてきめの細かい一時的な詳細を見落としています。
このペーパーでは、質問情報とモデルの連続時間的ダイナミクスを明示的に組み込んだ新しいフレームワークであるQA-Tigerを提案します。
私たちの重要なアイデアは、ガウスベースのモデリングを使用して、質問に基づいて連続したフレームと非継続的なフレームの両方に適応的に焦点を当て、質問情報を明示的に注入し、漸進的な改良を適用することです。
専門家（MOE）の混合物を活用して、複数のガウスモデルを柔軟に実装し、質問に特化した一時的な専門家を活性化します。
複数のAVQAベンチマークでの広範な実験は、QAタイガーが常に最先端のパフォーマンスを達成することを示しています。
コードはhttps://aim-skku.github.io/qa-tiger/で入手できます

要約(オリジナル)

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://aim-skku.github.io/QA-TIGER/

arxiv情報

著者	Hongyeob Kim,Inyoung Jung,Dayoon Suh,Youjia Zhang,Sangmin Lee,Sungeun Hong
発行日	2025-03-07 09:27:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Question-Aware Gaussian Experts for Audio-Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー