AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

要約

ロボットやウェアラブルを介して導入される AI パーソナルアシスタントには、人間と効果的に連携するための具体的な理解が必要です。
しかし、現在の視覚言語モデル (VLM) は主に三人称視点のビデオに焦点を当てており、自己中心的な知覚経験の豊かさは無視されています。
このギャップに対処するために、私たちは 3 つの主要な貢献を提案します。
まず、自己中心的なビデオに特有のビデオキャプションと質問応答タスクについて VLM をトレーニングするための、自己中心的なビデオ理解データセット (EVUD) を紹介します。
2 番目に、EVUD でパラメータ効率の高い方法を使用してトレーニングされた 7B パラメータ VLM である AlanaVLM を紹介します。
最後に、具体化されたビデオ質問応答の挑戦的なベンチマークである OpenEQA で AlanaVLM の機能を評価します。
私たちのモデルは最先端のパフォーマンスを実現し、プランナーとして GPT-4 を使用する強力な Socratic モデルを含むオープンソースモデルを 3.6% 上回ります。
さらに、Claude 3 および Gemini Pro Vision 1.0 を上回り、Gemini Pro 1.5 および GPT-4V と比較して競争力のある結果を示し、空間推論においても後者を上回っています。
この研究は、ロボットやウェアラブルに展開できる効率的な VLM を構築する道を開き、身体化されたビデオの理解を活用して日常業務で人間とシームレスに共同作業し、次世代の身体化 AI に貢献します。

要約(オリジナル)

AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM’s capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.

arxiv情報

著者	Alessandro Suglia,Claudio Greco,Katie Baker,Jose L. Part,Ioannis Papaioannou,Arash Eshghi,Ioannis Konstas,Oliver Lemon
発行日	2024-06-21 09:53:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー