Evaluating Multiview Object Consistency in Humans and Image Models

要約

3D 形状推論タスクにおける人間の観察者と視覚モデルの間の整合性を直接評価するベンチマークを導入します。
私たちは、物体の形状に関するゼロショットの視覚的推論を必要とする認知科学の実験デザインを活用しています。一連の画像が与えられると、参加者は、かなりの視点の変化にもかかわらず、どれが同じ/異なる物体を含んでいるかを識別します。
私たちは、一般的なオブジェクト (例: 椅子) だけでなく、抽象的な形状 (つまり、手続き的に生成された「無意味な」オブジェクト) も含む、さまざまな範囲の画像から描画します。
2,000 を超える固有の画像セットを構築した後、人間の参加者にこれらのタスクを実行し、500 人を超える参加者から 35,000 回の行動データを収集します。
これには、明示的な選択行動だけでなく、反応時間や視線データなどの中間測定も含まれます。
次に、一般的なビジョンモデル (DINOv2、MAE、CLIP など) のパフォーマンスを評価します。
人間がすべてのモデルを大幅に上回るパフォーマンスを示していることがわかりました。
マルチスケール評価アプローチを使用して、モデルと人間の間の根本的な類似点と相違点を特定します。人間とモデルのパフォーマンスには相関がある一方で、人間は困難な試験により多くの時間/処理を割り当てます。
すべての画像、データ、コードにはプロジェクトページからアクセスできます。

要約(オリジナル)

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense’ objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

arxiv情報

著者	Tyler Bonnen,Stephanie Fu,Yutong Bai,Thomas O’Connell,Yoni Friedman,Nancy Kanwisher,Joshua B. Tenenbaum,Alexei A. Efros
発行日	2024-09-09 17:59:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Multiview Object Consistency in Humans and Image Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー