Can Vision-Language Models Think from a First-Person Perspective?

要約

ビジョン言語モデル (VLM) は最近、従来の下流タスクにおいて有望な結果を示しています。
彼らの能力を評価するための評価研究が登場していますが、その大部分は三人称の視点に焦点を当てており、一人称の視点から特定のタスクに取り組んでいる研究はわずかです。
しかし、VLM の一人称視点から「考える」能力は、自律エージェントやロボット工学を進歩させるための重要な特性であり、ほとんど解明されていないままです。
この研究ギャップを埋めるために、12 の詳細な次元を持つ 6 つのコア機能を網羅する、新しい視覚的な質問応答ベンチマークである EgoThink を紹介します。
このベンチマークは、自己中心的なビデオから選択されたクリップを使用して構築されており、一人称情報を含む手動で注釈が付けられた質問と回答のペアが含まれています。
VLM を総合的に評価するために、EgoThink で 18 の人気のある VLM を評価します。
さらに、回答の自由形式を考慮して、単一回答による採点を計算するための自動判定として GPT-4 を使用します。
実験結果は、GPT-4V が多くの点で優れているにもかかわらず、評価されたすべての VLM が一人称視点のタスクを改善するかなりの潜在力をまだ持っていることを示しています。
一方、トレーニング可能なパラメーターの数を増やすことは、EgoThink のモデルのパフォーマンスに最も大きな影響を与えます。
結論として、EgoThink は VLM の既存の評価ベンチマークへの貴重な追加機能として機能し、身体化された人工知能とロボット工学の分野における将来の研究に不可欠なリソースを提供します。

要約(オリジナル)

Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to ‘think’ from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate eighteen popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

arxiv情報

著者	Sijie Cheng,Zhicheng Guo,Jingwen Wu,Kechen Fang,Peng Li,Huaping Liu,Yang Liu
発行日	2023-11-27 07:44:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Vision-Language Models Think from a First-Person Perspective?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー