Visual Scratchpads: Enabling Global Reasoning in Vision

要約

最新の視覚モデルは、局所的な特徴がターゲットに関する重要な情報を提供するベンチマークで目覚ましい成功を収めています。
現在、ローカルな特徴が重要な情報を提供しない、よりグローバルな推論を必要とするタスクを解決することへの関心が高まっています。
これらのタスクは、1969 年に Minsky と Papert によって議論された接続タスクを思い出させます。このタスクは、パーセプトロンモデルの限界を明らかにし、最初の AI の冬に貢献しました。
このペーパーでは、経路探索と迷路を含む 4 つのグローバルな視覚ベンチマークを紹介することで、そのようなタスクを再検討します。
(1) 今日の大規模視覚モデルは、初期のモデルの表現力の限界を大幅に超えていますが、依然として学習効率の面で苦労しています。
私たちはこの限界を理解するために「グローバル度」という概念を提唱しました。
(2) 次に、「視覚的スクラッチパッド」の導入によって状況が変化し、全体的な推論が可能になることを示します。
言語モデルで使用されるテキストのスクラッチパッドや思考の連鎖と同様に、ビジュアルなスクラッチパッドは、グローバルなタスクをより単純なタスクに分割するのに役立ちます。
(3) 最後に、一部のスクラッチパッドが他のスクラッチパッドよりも優れていることを示します。特に、より少ない情報に依存してステップを実行する「誘導スクラッチパッド」は、より優れた配布外一般化を可能にし、より小さいモデルサイズで成功します。

要約(オリジナル)

Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in solving tasks that require more global reasoning, where local features offer no significant information. These tasks are reminiscent of the connectivity tasks discussed by Minsky and Papert in 1969, which exposed the limitations of the perceptron model and contributed to the first AI winter. In this paper, we revisit such tasks by introducing four global visual benchmarks involving path findings and mazes. We show that: (1) although today’s large vision models largely surpass the expressivity limitations of the early models, they still struggle with the learning efficiency; we put forward the ‘globality degree’ notion to understand this limitation; (2) we then demonstrate that the picture changes and global reasoning becomes feasible with the introduction of ‘visual scratchpads’; similarly to the text scratchpads and chain-of-thoughts used in language models, visual scratchpads help break down global tasks into simpler ones; (3) we finally show that some scratchpads are better than others, in particular, ‘inductive scratchpads’ that take steps relying on less information afford better out-of-distribution generalization and succeed for smaller model sizes.

arxiv情報

著者	Aryo Lotfi,Enrico Fini,Samy Bengio,Moin Nabi,Emmanuel Abbe
発行日	2024-10-10 17:44:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Scratchpads: Enabling Global Reasoning in Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー