Vision Language Models as Values Detectors

要約

テキスト入力とビジュアル入力を統合した大規模言語モデルにより、複雑なデータを解釈するための新しい可能性が導入されました。
視覚刺激に基づいて一貫性があり、文脈的に関連性のあるテキストを生成するという驚くべき能力にもかかわらず、画像内の関連要素を識別する際にこれらのモデルを人間の知覚と整合させるにはさらなる調査が必要です。
この論文では、家庭環境シナリオ内で関連性のある要素を検出する際の、最先端の LLM とヒューマンアノテーターとの連携を調査します。
私たちは、さまざまな家庭のシナリオを描いた 12 枚の画像のセットを作成し、14 人のアノテーターに各画像の重要な要素を特定するよう依頼しました。
次に、これらの人間の反応を、GPT-4o および 4 つの LLaVA バリアントを含む 5 つの異なる LLM からの出力と比較しました。
私たちの調査結果では、LLaVA 34B が最高のパフォーマンスを示しているものの、依然としてスコアが低いなど、さまざまな程度の整合性が明らかになりました。
しかし、結果の分析では、画像内の価値を含む要素を検出するモデルの潜在力が強調されており、LLM は、トレーニングの改善と洗練されたプロンプトにより、より深い洞察を提供することで、ソーシャルロボット工学、支援技術、および人間とコンピューターのインタラクションにおけるアプリケーションを強化できる可能性があることを示唆しています。
より文脈に即した応答が得られます。

要約(オリジナル)

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models’ potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

arxiv情報

著者	Giulio Antonio Abbo,Tony Belpaeme
発行日	2025-01-07 17:37:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Language Models as Values Detectors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー