Pretrained Language Models as Visual Planners for Human Assistance

要約

【タイトル】人間のアシスタントのための事前学習済み言語モデルを使用した視覚的なプランナー

【要約】
– 多様なタスクに取り組むためのマルチモーダルAIアシスタントの実現を目指し、Visual Planning for Assistance (VPA)というタスクを提案する。
– VPAは、自然言語で簡単に説明されたゴール（例えば、「棚を作る」）と、ユーザーの進捗を示すビデオが与えられた場合に、アクションのシーケンス（「棚に研ぎ出しをする」、「棚に塗料を塗る」など）を獲得することを目的とする。
– これには、ビデオ履歴の長さや、アクションの複雑な依存関係などの課題がある。これらを解決するため、VPAをビデオアクションセグメンテーションと予測に分割する。
– 予測ステップをマルチモーダルなシーケンスモデリング問題として定式化し、Visual Language Model based Planner (VLaMP)を紹介する。VLaMPは、事前学習されたLMをシーケンスモデルとして利用している。
– VLaMPは、生成されたプランを評価するすべてのメトリックにおいて、ベースラインに比べて有意に優れていることが実証されている。また、言語の事前学習、視覚的観察、およびゴール情報の価値を分離する広範な実験も行われている。
– ビジュアルプランニングに関する将来の研究を可能にするために、データ、モデル、コードを公開する予定である。

要約(オリジナル)

To make progress towards multi-modal AI assistants which can guide users to achieve complex multi-step goals, we propose the task of Visual Planning for Assistance (VPA). Given a goal briefly described in natural language, e.g., ‘make a shelf’, and a video of the user’s progress so far, the aim of VPA is to obtain a plan, i.e., a sequence of actions such as ‘sand shelf’, ‘paint shelf’, etc., to achieve the goal. This requires assessing the user’s progress from the untrimmed video, and relating it to the requirements of underlying goal, i.e., relevance of actions and ordering dependencies amongst them. Consequently, this requires handling long video history, and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. We formulate the forecasting step as a multi-modal sequence modeling problem and present Visual Language Model based Planner (VLaMP), which leverages pre-trained LMs as the sequence model. We demonstrate that VLaMP performs significantly better than baselines w.r.t all metrics that evaluate the generated plan. Moreover, through extensive ablations, we also isolate the value of language pre-training, visual observations, and goal information on the performance. We will release our data, model, and code to enable future research on visual planning for assistance.

arxiv情報

著者	Dhruvesh Patel,Hamid Eghbalzadeh,Nitin Kamra,Michael Louis Iuzzolino,Unnat Jain,Ruta Desai
発行日	2023-04-27 21:31:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Pretrained Language Models as Visual Planners for Human Assistance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー