ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

要約

画像のシーケンス上の推論は、マルチモーダルの大手言語モデル（MLLMS）にとって課題のままです。
最近のモデルは、トレーニング前にマルチイメージデータを組み込んでいますが、シーケンシャル構造を認識するのに苦労しており、多くの場合画像を独立して扱います。
このワークでは、視覚シーケンスをマルチターン会話としてモデル化することにより、画像データ上のシーケンシャルな推論機能を備えたMLLMを強化するフレームワークであるImageChainを紹介します。
ImageChainでは、画像は対応するテキストの説明とインターリーブして、時間的依存関係と物語の進行を明示的にキャプチャする制御された対話を形成します。
私たちの方法は、次のシーンの説明のタスクを最適化します。ここで、モデルは、前の視覚的およびテキストのキューに基づいて、今後のシーンのコンテキスト認識の説明を生成します。
私たちのアプローチは、次のシーンの説明タスクのパフォーマンスを向上させることを実証します – SIMRateで3.7％から19％への平均改善を達成します。
さらに、ImageChainは、コミックからロボット工学までのアプリケーションで、堅牢なゼロショットのドメイン外のパフォーマンスを実現します。
広範な実験では、マルチモーダルのマルチターン会話デザインでの命令調整が、静的画像の理解と一時的に認識される推論のギャップを埋めるための鍵であることを検証します。

要約(オリジナル)

Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task — achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.

arxiv情報

著者	Danae Sánchez Villegas,Ingo Ziegler,Desmond Elliott
発行日	2025-02-26 18:55:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー