Dual Thinking and Logical Processing — Are Multi-modal Large Language Models Closing the Gap with Human Vision ?

要約

デュアル思考フレームワークでは、高速で直感的な処理と遅い論理処理を考慮します。
ビジョンにおける二重の思考の認識には、直感的で論理的な処理からの推論が異なる画像が必要です。
敵対的なデータセットを導入して、深い学習モデルの定性的行動を研究するのにも役立つ、人間のビジョンにおける二重の思考フレームワークの証拠を提供します。
証拠は、人間の視覚のインスタンスを特定する際の形状の重要性を強調しています。
私たちの精神物理学的研究は、迅速に連続して複数の推論が存在することを示しており、エラーの分析により、視覚処理の早期停止により関連情報が欠落する可能性があることが示されています。
私たちの研究は、セグメンテーションモデルには、サブコンポーネントの位置と数に関連するエラーによって示されるように、サブ構造の理解がないことを示しています。
さらに、モデルと直感的な人間の処理によって行われたエラーの類似性は、モデルが人間の視覚における直感的な思考にのみ対処することを示しています。
対照的に、オープンソースモデルを含むマルチモーダルLLMは、直感的な処理で行われたエラーで大きな進歩を示しています。
モデルは、論理的推論を必要とする画像のパフォーマンスを改善し、サブコンポーネントの認識を示しています。
ただし、直感的な処理のエラーで行われたパフォーマンスの改善とは一致していません。

要約(オリジナル)

The dual thinking framework considers fast, intuitive processing and slower, logical processing. The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ. We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision, which also aids in studying the qualitative behavior of deep learning models. The evidence underscores the importance of shape in identifying instances in human vision. Our psychophysical studies show the presence of multiple inferences in rapid succession, and analysis of errors shows the early stopping of visual processing can result in missing relevant information. Our study shows that segmentation models lack an understanding of sub-structures, as indicated by errors related to the position and number of sub-components. Additionally, the similarity in errors made by models and intuitive human processing indicates that models only address intuitive thinking in human vision. In contrast, multi-modal LLMs, including open-source models, demonstrate tremendous progress on errors made in intuitive processing. The models have improved performance on images that require logical reasoning and show recognition of sub-components. However, they have not matched the performance improvements made on errors in intuitive processing.

arxiv情報

著者	Kailas Dayanandan,Nikhil Kumar,Anand Sinha,Brejesh Lall
発行日	2025-01-30 14:37:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dual Thinking and Logical Processing — Are Multi-modal Large Language Models Closing the Gap with Human Vision ?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー