Synthetic Vision: Training Vision-Language Models to Understand Physics

要約

動的環境における物体の動作の解釈、理解、予測を含む物理的推論は、現在の視覚言語モデル (VLM) にとって依然として大きな課題です。
この研究では、シミュレートされたデータを使用して VLM の物理的推論能力を強化する 2 つの方法を提案します。
まず、物理的推論タスクに関連するシミュレーションから生成された質問と回答 (QA) のペアを使用して、事前トレーニングされた VLM を微調整します。
2 番目に、物理特性とプロセスを豊富に含むシーン記述を作成するために微調整された特殊な VLM である物理コンテキストビルダー (PCB) を紹介します。
物理的推論タスク中に、これらの PCB をコンテキストとして利用して、大規模言語モデル (LLM) のパフォーマンスを向上させることができます。
私たちは、シミュレートされたシーンと現実世界のシーンの両方を含む Falling Tower と呼ばれる新しい安定性検出 QA データセットや CLEVRER など、複数のベンチマークを使用して両方のアプローチを評価します。
小規模な QA で微調整された VLM が、大規模な最先端の基本モデルよりも大幅に優れたパフォーマンスを発揮できることを実証します。
また、PCB を統合すると、物理推論タスクにおける基本的な LLM のパフォーマンスが向上することも示します。
Falling Tower データセットの現実世界のシーンを使用して、Sim2Real 転送における両方のアプローチの堅牢性も検証します。
私たちの結果は、高度な物理的推論が可能な学習システムの作成においてシミュレートされたデータが持つ有用性を強調しています。

要約(オリジナル)

Physical reasoning, which involves the interpretation, understanding, and prediction of object behavior in dynamic environments, remains a significant challenge for current Vision-Language Models (VLMs). In this work, we propose two methods to enhance VLMs’ physical reasoning capabilities using simulated data. First, we fine-tune a pre-trained VLM using question-answer (QA) pairs generated from simulations relevant to physical reasoning tasks. Second, we introduce Physics Context Builders (PCBs), specialized VLMs fine-tuned to create scene descriptions enriched with physical properties and processes. During physical reasoning tasks, these PCBs can be leveraged as context to assist a Large Language Model (LLM) to improve its performance. We evaluate both of our approaches using multiple benchmarks, including a new stability detection QA dataset called Falling Tower, which includes both simulated and real-world scenes, and CLEVRER. We demonstrate that a small QA fine-tuned VLM can significantly outperform larger state-of-the-art foundational models. We also show that integrating PCBs boosts the performance of foundational LLMs on physical reasoning tasks. Using the real-world scenes from the Falling Tower dataset, we also validate the robustness of both approaches in Sim2Real transfer. Our results highlight the utility that simulated data can have in the creation of learning systems capable of advanced physical reasoning.

arxiv情報

著者	Vahid Balazadeh,Mohammadmehdi Ataei,Hyunmin Cheong,Amir Hosein Khasahmadi,Rahul G. Krishnan
発行日	2024-12-11 18:40:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Synthetic Vision: Training Vision-Language Models to Understand Physics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー