Probing the Mid-level Vision Capabilities of Self-Supervised Learning

要約

一般的なオブジェクトの位置特定や 3D 幾何学的な理解などの中レベルの視覚機能は、人間の視覚の基礎であるだけでなく、コンピュータービジョンの多くの実世界のアプリケーションにとっても重要です。
これらの能力は、人間の視覚発達の初期段階で最小限の監督下で発現します。
その重要性にもかかわらず、現在の自己教師あり学習 (SSL) アプローチは主に高レベルの認識タスク用に設計および評価されており、中レベルの視覚機能はほとんど検討されていません。
この研究では、中間レベルの視覚機能を体系的に評価するための一連のベンチマークプロトコルを導入し、8 つの中間レベルの視覚タスクにわたる 22 の著名な SSL モデルの包括的で管理された評価を提示します。
私たちの実験では、中レベルのタスクのパフォーマンスと高レベルのタスクのパフォーマンスの間に弱い相関があることが明らかになりました。
また、中レベルの機能と高レベルの機能の間でパフォーマンスが非常に不均衡ないくつかの SSL メソッドと、両方の機能で優れているいくつかの SSL メソッドも特定します。
さらに、事前トレーニングの目標やネットワークアーキテクチャなど、中レベルの視覚パフォーマンスに寄与する重要な要素を調査します。
私たちの研究は、SSL モデルが学習したことの全体的かつタイムリーなビューを提供し、主に高レベルの視覚タスクに焦点を当てた既存の研究を補完します。
私たちの調査結果が、今後の SSL 研究の指針となり、高レベルのビジョンタスクだけでなく中レベルのモデルのベンチマークにもつながることを願っています。

要約(オリジナル)

Mid-level vision capabilities – such as generic object localization and 3D geometric understanding – are not only fundamental to human vision but are also crucial for many real-world applications of computer vision. These abilities emerge with minimal supervision during the early stages of human visual development. Despite their significance, current self-supervised learning (SSL) approaches are primarily designed and evaluated for high-level recognition tasks, leaving their mid-level vision capabilities largely unexamined. In this study, we introduce a suite of benchmark protocols to systematically assess mid-level vision capabilities and present a comprehensive, controlled evaluation of 22 prominent SSL models across 8 mid-level vision tasks. Our experiments reveal a weak correlation between mid-level and high-level task performance. We also identify several SSL methods with highly imbalanced performance across mid-level and high-level capabilities, as well as some that excel in both. Additionally, we investigate key factors contributing to mid-level vision performance, such as pretraining objectives and network architectures. Our study provides a holistic and timely view of what SSL models have learned, complementing existing research that primarily focuses on high-level vision tasks. We hope our findings guide future SSL research to benchmark models not only on high-level vision tasks but on mid-level as well.

arxiv情報

著者	Xuweiyi Chen,Markus Marks,Zezhou Cheng
発行日	2024-12-16 18:55:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probing the Mid-level Vision Capabilities of Self-Supervised Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー