Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

要約

大規模な言語モデル（LLMS）の最近の進歩は、推論能力の強化を実証しており、OpenAI O1のような高度な製品指向のソリューションに促される考え方（COT）から進化しています。
このモデルの再実装中に、視覚入力（例えば、ジオメトリの問題）を必要とするマルチモーダルタスクでは、マルチモーダルLLMS（MLLM）が視覚情報の焦点を維持するのに苦労していることに気付きました。
これを調査するために、長鎖の推論中に画像入力を樹立します。
具体的には、推論プロセスを途中で切り捨て、入力画像が削除された状態で推論プロセスを再現します。
Mathvistaのテストハードサブセットでは、Mathvistaのテストハードサブセットでの精度の低下のみが観察され、モデルのテキスト出力が次の推論プロセスを支配していることがわかります。
これに動機付けられていると、画像入力を重要な推論段階にシフトし、動的な剪定を介して冗長な視覚トークンを圧縮する戦略である、視覚的な条件付け（TVC）を取り入れることを提案します。
この方法論は、モデルが推論を通して視覚コンポーネントに注意を維持するのに役立ちます。
私たちのアプローチは、5つの数学的推論ベンチマーク（+3.4％対以前のSOTA）にわたって平均して最先端のパフォーマンスを達成し、マルチモーダル推論システムの強化におけるTVCの有効性を実証しています。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista’s test-hard subset, revealing the model’s textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.

arxiv情報

著者	Hai-Long Sun,Zhun Sun,Houwen Peng,Han-Jia Ye
発行日	2025-03-17 16:45:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー