When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

要約

具体化された意思決定は、実際の環境で動作するAIエージェントの基本です。
視覚言語モデル（VLM）はこの能力を進めていますが、特に人間のニーズと価値についての深い推論を必要とする人間中心の状況では、複雑な決定に苦労しています。
この研究では、マルチモーダルのヒト中心の意思決定タスクでオープンソースのVLMを体系的に評価します。
テキストの説明のみを受信するLLMは、実際の画像を処理する同様のスケールのVLMカウンターパートを予想外に上回ることがわかり、視覚的アライメントがVLM能力を妨げる可能性があることを示唆しています。
この課題に対処するために、合成されたテキストデータを使用した新しいテキストのみのトレーニングアプローチを提案します。
この方法は、VLMSの言語コンポーネントを強化し、学習能力をマルチモーダル推論に転送し、高価な画像テキストペアのデータの必要性を排除します。
さらに、GPT-4などの大規模な教師モデルに依存するのではなく、LLMのカウンターパートによって生成されたトレーニングデータを使用して、VLMが自己改善を通じてかなりのパフォーマンスの向上を達成できることを示しています。
私たちの調査結果は、VLMSの人間中心の意思決定能力を強化するためのより効率的でスケーラブルなアプローチを確立し、自己改善メカニズムを通じてVLMを最適化するための新しい道を開きます。

要約(オリジナル)

Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs’ language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs’ human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.

arxiv情報

著者	Zhe Hu,Jing Li,Yu Yin
発行日	2025-03-21 09:25:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー