Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

要約

カラーパッチやカラーグリッドなどのシンプルで抽象的な視覚刺激を特徴とする参照解像度タスクで、マルチモーダルの大手言語モデルの言語能力を調査します。
タスクは、今日の言語モデルにとって挑戦的ではないように思えるかもしれませんが、人間のダイアドにとっては簡単であるため、MLLMの実用的な能力の非常に関連性の高いプローブであると考えています。
私たちの結果と分析は、実際に、色の説明のコンテキスト依存的な解釈などの基本的な実用的な能力が、最先端のMLLMの主要な課題であることを示唆しています。

要約(オリジナル)

We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.

arxiv情報

著者	Simeon Junker,Manar Ali,Larissa Koch,Sina Zarrieß,Hendrik Buschmeier
発行日	2025-06-13 14:09:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー