Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

要約

現在のコンピュータービジョンモデルは、人間の視覚システムとは異なり、汎用的な視覚的理解をまだ実現できていません。
一般的なビジョンモデルを作成するための既存の取り組みは、評価されたタスクの範囲に限定されており、それらを全体的に実行するための包括的なフレームワークを提供していません。
$\unicode{x2014}$ Perceive、Ground、Reason、および Act の 4 つの機能ドメインを使用して、視覚認知能力の全範囲をカバーする新しい包括的なベンチマーク、汎用視覚理解評価 (G-VUE) を提示します。
4 つのドメインは、3D 再構成から視覚的な推論と操作まで、11 の厳選されたタスクで具現化されています。
ベンチマークに加えて、11 のタスクすべてで任意の視覚的表現を評価できるように、一般的なエンコーダー/デコーダーフレームワークを提供します。
フレームワークを使用してさまざまな事前トレーニング済みの視覚的表現を評価し、(1) Transformer ベースの視覚的バックボーンは一般的に G-VUE 上の CNN ベースのバックボーンよりも優れていること、(2) ビジョン言語の事前トレーニングからの視覚的表現は、
視覚タスク全体での視覚のみの事前トレーニング。
G-VUE では、より汎用的な視覚的表現を取得することで、汎用的な視覚システムの構築に向けた研究を促進するための総合的な評価基準を提供しています。

要約(オリジナル)

Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.

arxiv情報

著者	Jiangyong Huang,William Yicheng Zhu,Baoxiong Jia,Zan Wang,Xiaojian Ma,Qing Li,Siyuan Huang
発行日	2022-11-28 15:06:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー