ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

要約

人間は、他の人間が実行しているさまざまな行動を（物理的に、またはビデオや画像で）観察し、それについて視覚的に認識できる範囲を超えた幅広い推論を引き出すことができます。
このような推論には、アクションの実行を可能にする世界の側面の決定 (例: 液体オブジェクトが注がれる可能性がある)、アクションの結果として世界がどのように変化するか予測 (例: 揚げた後のジャガイモが黄金色になりカリカリになる)、高レベルの目標が含まれます。
アクションに関連付けられ（例：卵を叩いてオムレツを作る）、現在のアクションの前後にある可能性のあるアクションについて推論します（例：泡立てる前に卵を割ったり、パスタを茹でた後に水を切ったり）。
同様の推論能力は、日常業務の実行を支援する自律システムにおいても非常に望ましいものです。
そのために、画像内で実行されるアクションに関する前述の概念を学習するためのマルチモーダルタスクを提案します。
注釈付きの料理ビデオデータセットから収集された、8.5k の画像と、それらの画像に基づくアクションに関する 59.3k の推論で構成されるデータセットを開発します。
私たちは、提供された視覚入力に特有の言語モデルに存在する知識を識別するためのゼロショットフレームワークである ActionCOMET を提案します。
収集したデータセットに対する ActionCOMET のベースライン結果を提示し、既存の最良の VQA アプローチのパフォーマンスと比較します。

要約(オリジナル)

Humans observe various actions being performed by other humans (physically or in videos/images) and can draw a wide range of inferences about it beyond what they can visually perceive. Such inferences include determining the aspects of the world that make action execution possible (e.g. liquid objects can undergo pouring), predicting how the world will change as a result of the action (e.g. potatoes being golden and crispy after frying), high-level goals associated with the action (e.g. beat the eggs to make an omelet) and reasoning about actions that possibly precede or follow the current action (e.g. crack eggs before whisking or draining pasta after boiling). Similar reasoning ability is highly desirable in autonomous systems that would assist us in performing everyday tasks. To that end, we propose a multi-modal task to learn aforementioned concepts about actions being performed in images. We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images, collected from an annotated cooking-video dataset. We propose ActionCOMET, a zero-shot framework to discern knowledge present in language models specific to the provided visual input. We present baseline results of ActionCOMET over the collected dataset and compare them with the performance of the best existing VQA approaches.

arxiv情報

著者	Shailaja Keyur Sampat,Yezhou Yang,Chitta Baral
発行日	2024-10-17 15:22:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー