Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

要約

現実世界のシナリオでモデルが人間と効果的に対話するには、マルチモーダルな入力にわたって複雑な推論を実行する機能が不可欠です。
視覚言語モデルの進歩により、Visual Question Answering (VQA) や Visual Grounding (VG) など、明示的かつ直接的なテキスト入力の処理を必要とするタスクのパフォーマンスが大幅に向上しました。
しかし、微妙であいまいなコミュニケーション形式を理解するためのモデル機能の改善にはあまり注目されていません。
現実世界のインタラクションにおける人間の言語は、正確な解釈をコンテキストに依存する隠された意図を伝えることが多いため、これは重大な課題となります。
このギャップに対処するために、我々は、対応するシーンと対になった 3.9K の間接的な人間の発話で構成されるマルチモーダルベンチマークである VAGUE を提案します。
さらに、入力画像から即時解決ペアを生成するためのモデルベースのパイプラインにも貢献します。
私たちの研究の目的は、間接的なコミュニケーションを理解するモデルの能力をさらに深く掘り下げ、より洗練された人間のような対話が可能なモデルの開発に貢献することを目指すことです。
複数の VLM を広範に評価した結果、複雑な言語的および視覚的な推論を実行する必要がある場合、主流のモデルは依然として間接的なコミュニケーションに苦労していることが明らかになりました。
コードとデータは https://github.com/Hazel-Heejeong-Nam/VAGUE.git でリリースされています。

要約(オリジナル)

The ability to perform complex reasoning across multimodal inputs is essential for models to effectively interact with humans in real-world scenarios. Advancements in vision-language models have significantly improved performance on tasks that require processing explicit and direct textual inputs, such as Visual Question Answering (VQA) and Visual Grounding (VG). However, less attention has been given to improving the model capabilities to comprehend nuanced and ambiguous forms of communication. This presents a critical challenge, as human language in real-world interactions often convey hidden intentions that rely on context for accurate interpretation. To address this gap, we propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes. Additionally, we contribute a model-based pipeline for generating prompt-solution pairs from input images. Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions. Extensive evaluation on multiple VLMs reveals that mainstream models still struggle with indirect communication when required to perform complex linguistic and visual reasoning. We release our code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.

arxiv情報

著者	Heejeong Nam,Jinwoo Ahn
発行日	2024-11-21 14:01:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー