RAVEN: Multitask Retrieval Augmented Vision-Language Learning

要約

世界中のすべての知識をモデルパラメータにエンコードするために大規模な言語モデルをスケーリングすることは持続不可能であり、リソースの障壁を悪化させています。
検索拡張生成 (RAG) は潜在的な解決策を示していますが、ビジョン言語モデル (VLM) への応用はまだ検討中です。
既存の手法は、単一タスク用に設計されたモデルに焦点を当てています。
さらに、リソースを大量に消費する事前トレーニングの必要性、追加のパラメータ要件、対処されていないモダリティの優先順位付け、および非検索ベースラインに対する明確な利点の欠如によって制限されています。
このペーパーでは、タスク固有の効率的な微調整を通じてベース VLM を強化する、マルチタスク検索拡張 VLM フレームワークである RAVEN を紹介します。
追加の検索固有のパラメーターを必要とせずに検索拡張サンプルを統合することにより、モデルが複数のタスクにわたって有効な検索プロパティを取得することを示します。
私たちの結果と、画像キャプションおよび VQA タスクの取得されたモダリティにわたる広範なアブレーションは、取得されていないベースラインと比較してパフォーマンスが大幅に向上しており、MSCOCO では +1 CIDEr、NoCaps では +4 CIDEr、特定の VQA 質問タイプでは +3\% 近い精度を示しています。
これは、RAG アプローチを VLM に適用することの有効性を強調し、より効率的でアクセスしやすいマルチモーダル学習への前進を示しています。

要約(オリジナル)

The scaling of large language models to encode all the world’s knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they’re limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.

arxiv情報

著者	Varun Nagaraj Rao,Siddharth Choudhary,Aditya Deshpande,Ravi Kumar Satzoda,Srikar Appalaraju
発行日	2024-06-27 13:08:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RAVEN: Multitask Retrieval Augmented Vision-Language Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー