ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

要約

視覚的に介入されたチェーンオブテアのチェーン（VI-COT）により、MLLMは、さまざまなタスクで印象的な成功を示す人間の意志と同じように、段階的な中間視覚状態（IV）に基づいて理解と決定を継続的に更新することができ、それによって関連するベンチマークの進歩が明らかになります。
有望な進歩にもかかわらず、現在のベンチマークは、フリースタイルのIVではなく比較的固定されたIVをモデルに提供します。
さらに重要なことは、既存のベンチマークが、IVが手付かずの推論パフォーマンスに与える影響要因を体系的に調査することを怠ることです。
上記のギャップに取り組むために、迷路ナビゲーション、ジグソーパズル、具体化された長距離計画、および複雑なカウントの4つの代表的なタスクで構成されるVICベンチと呼ばれる特殊なベンチマークを紹介します。
VI-COT機能を体系的に調べるために、ターゲットを絞った新しいメトリックを備えた進歩的な3段階戦略を組み込んだ徹底的な評価スイートを提案します。
その上、VI-COTのプロンプト要因を乱暴に調査するために、インクリメントプロンプト情報インジェクション（IPII）戦略を確立します。
18の高度なMLLMの評価を広範囲に実施し、VI-COT機能に関する重要な洞察を明らかにしています。
提案されたベンチマークは、Huggingfaceで公開されています。

要約(オリジナル)

Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.

arxiv情報

著者	Xuecheng Wu,Jiaxing Liu,Danlei Huang,Xiaoyu Li,Yifan Wang,Chen Chen,Liya Ma,Xuezhi Cao,Junxiao Xue
発行日	2025-06-12 17:01:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー