Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

要約

大規模言語モデル (LLM) の隆盛に伴い、LLM と事前トレーニングされたビジョンモデルを組み込んだマルチモーダル大規模言語モデル (MLLM) は、最近、さまざまなビジョン言語タスクにわたって優れたパフォーマンスを実証しています。
ただし、複数の画像が含まれるコンテキストを理解するには不十分です。
この欠点の主な理由は、各画像の視覚的特徴が LLM バックボーンに供給される前にフリーズされたエンコーダーによって個別にエンコードされ、他の画像やマルチモーダル命令を認識できないことです。
私たちはこの問題を以前の LLM モダリティ分離と呼び、機能を LLM に供給する前に詳細なマルチモーダルコンテキストの融合を可能にする、ブラウズと集中の 2 段階のパラダイムを提案します。
このパラダイムは、最初に重要な洞察を得るために入力を「参照」し、次にその入力を再検討して、これらの洞察に基づいて重要な詳細に「集中」し、マルチモーダルな入力のより包括的な理解を達成します。
さらに、特に複数画像入力の理解を強化するためのトレーニング戦略を開発します。
私たちの手法は 7 つのマルチイメージシナリオでパフォーマンスを大幅に向上させ、3B および 11B LLM の強力な MLLM ベースラインに対して、平均精度がそれぞれ 2.13% および 7.60% 向上しました。

要約(オリジナル)

With the bloom of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. A primary reason for this shortcoming is that the visual features for each images are encoded individually by frozen encoders before feeding into the LLM backbone, lacking awareness of other images and the multimodal instructions. We term this issue as prior-LLM modality isolation and propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion prior to feeding the features into LLMs. This paradigm initially ‘browses’ through the inputs for essential insights, and then revisits the inputs to ‘concentrate’ on crucial details, guided by these insights, to achieve a more comprehensive understanding of the multimodal inputs. Additionally, we develop training strategies specifically to enhance the understanding of multi-image inputs. Our method markedly boosts the performance on 7 multi-image scenarios, contributing to increments on average accuracy by 2.13% and 7.60% against strong MLLMs baselines with 3B and 11B LLMs, respectively.

arxiv情報

著者	Ziyue Wang,Chi Chen,Yiqi Zhu,Fuwen Luo,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Maosong Sun,Yang Liu
発行日	2024-02-19 14:59:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー