CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

要約

マルチモーダル構成推論アプローチは目覚ましい進歩を遂げていますが、多くのモデルパラメーターを更新しながら固定モダリティ入力を処理するため、柔軟性と効率性には依然として限界があります。
この論文では、これらの重要な課題に取り組み、ビデオ推論に新しいモダリティを注入するための効率的でモジュール式のモダリティ融合フレームワークである CREMA を提案します。
まず、既存の事前トレーニング済みモデルを活用して、人間による追加の注釈なしで、特定のビデオから複数の情報モダリティ (オプティカルフロー、3D 点群、オーディオなど) を強化します。
次に、アクセス可能な各モダリティに関連付けられた複数のパラメーター効率の高いモジュールを備えたクエリトランスフォーマーを紹介します。
多様なモダリティ機能を LLM トークン埋め込み空間に投影し、モデルが応答生成のためにさまざまなデータ型を統合できるようにします。
さらに、追加のモダリティを組み合わせながら、LLM での計算効率を維持しながら、マルチモーダルクエリを圧縮するように設計された融合モジュールを提案します。
ビデオ 3D、ビデオオーディオ、ビデオ言語推論タスクでメソッドを検証し、BLIP-2、3D-LLM、SeViLA などの強力なマルチモーダル LLM に対して、使用するトレーニング可能なパラメーターを 96% 削減しながら、より優れた/同等のパフォーマンスを達成しました。
推論ドメインに対する各モダリティの影響、融合モジュールの設計、視覚化の例など、CREMA の広範な分析を提供します。

要約(オリジナル)

Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited in their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra human annotation by leveraging existing pre-trained models. Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. It projects diverse modality features to the LLM token embedding space, allowing the model to integrate different data types for response generation. Furthermore, we propose a fusion module designed to compress multimodal queries, maintaining computational efficiency in the LLM while combining additional modalities. We validate our method on video-3D, video-audio, and video-language reasoning tasks and achieve better/equivalent performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and SeViLA while using 96% fewer trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.

arxiv情報

著者	Shoubin Yu,Jaehong Yoon,Mohit Bansal
発行日	2024-02-08 18:27:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー