Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

要約

最近のビジョン言語モデル（VLM）は、印象的なマルチモーダルの理解と推論能力を実証していますが、しばしば些細な単純な視覚タスクに苦労しています。
この作業では、基本的な2Dユークリッドジオメトリのドメインに焦点を当て、原子視覚スキルと呼ばれる基本的で不可分な視覚的知覚スキルを体系的に分類します。
次に、Atomic Visual Skills Dataset（AVSD）を紹介して、Atomic Visual SkillsのVLMSを評価します。
AVSDを使用して、最先端のVLMをベンチマークし、成人にとって些細なことであるにもかかわらず、これらのタスクに苦しんでいることがわかります。
私たちの調査結果は、コンポジットの視覚的知覚タスクではなく、原子のVLMをトレーニングおよび評価するための専用のデータセットの必要性を強調しています。

要約(オリジナル)

Recent Vision-Language Models (VLMs) have demonstrated impressive multimodal comprehension and reasoning capabilities, yet they often struggle with trivially simple visual tasks. In this work, we focus on the domain of basic 2D Euclidean geometry and systematically categorize the fundamental, indivisible visual perception skills, which we refer to as atomic visual skills. We then introduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on the atomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and find that they struggle with these tasks, despite being trivial for adult humans. Our findings highlight the need for purpose-built datasets to train and evaluate VLMs on atomic, rather than composite, visual perception tasks.

arxiv情報

著者	Hyunsik Chae,Seungwoo Yoon,Jaden Park,Chloe Yewon Chun,Yongin Cho,Mu Cai,Yong Jae Lee,Ernest K. Ryu
発行日	2025-05-26 14:09:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー