OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

要約

現代の MLLM を開発するための標準的な手法は、ビジョンエンコーダから LLM に機能をフィードし、自然言語監視でトレーニングすることです。
この研究では、ビジョンの観点 (目的) を通じて中間 LLM 表現を最適化する、見落とされている機会を仮定します。つまり、自然言語による監視だけでは、MLLM の視覚理解能力には最適ではありません。
そのために、私たちは OLA-VLM を提案します。これは、一連のターゲット視覚表現から LLM の隠された表現に知識を抽出する最初のアプローチです。
まず、MLLM の事前トレーニング段階で、予測ビジュアル埋め込みと次のテキストトークン予測を組み合わせた最適化として目標を定式化します。
次に、自然言語監視のみでトレーニングされた MLLM を調査し、これらのモデル内の視覚表現の品質と下流のパフォーマンスとの間に正の相関関係があることを特定します。
さらに、OLA-VLM を調査すると、埋め込みの最適化により表現品質が向上していることがわかります。
第三に、OLA-VLM が単一エンコーダーおよびマルチエンコーダーのベースラインよりも優れていることを実証し、対応する機能を LLM に明示的に供給するよりも私たちのアプローチの優位性を証明します。
特に、OLA-VLM は、さまざまなベンチマークで平均最大 2.5% のマージンでパフォーマンスを向上させ、CV-Bench の深さタスクでは 8.7% という顕著な改善をもたらしました。
私たちのコードは https://github.com/SHI-Labs/OLA-VLM でオープンソース化されています。

要約(オリジナル)

The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. In this work, we posit an overlooked opportunity to optimize the intermediate LLM representations through a vision perspective (objective), i.e., solely natural language supervision is sub-optimal for the MLLM’s visual understanding ability. To that end, we propose OLA-VLM, the first approach distilling knowledge into the LLM’s hidden representations from a set of target visual representations. Firstly, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next text-token prediction. Secondly, we investigate MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Moreover, upon probing our OLA-VLM, we observe improved representation quality owing to the embedding optimization. Thirdly, we demonstrate that our OLA-VLM outperforms the single and multi-encoder baselines, proving our approach’s superiority over explicitly feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench. Our code is open-sourced at https://github.com/SHI-Labs/OLA-VLM .

arxiv情報

著者	Jitesh Jain,Zhengyuan Yang,Humphrey Shi,Jianfeng Gao,Jianwei Yang
発行日	2024-12-12 18:55:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー