An Examination of the Compositionality of Large Generative Vision-Language Models

要約

大規模言語モデル (LLM) の成功により、マルチモーダル命令チューニングを介して生成視覚言語モデル (GVLM) が急増しました。
この調整レシピは、一般的な対比視覚言語学習とは大きく異なります。
ただし、既存の評価指標やベンチマークは主に CLIP のような対照的なモデルの評価に焦点を当てているため、マルチモーダル構成推論における GVLM のパフォーマンスはほとんど解明されていないままです。
この論文では、GVLM を評価するための潜在的な評価指標を検討し、生成スコア手法が構成性の評価に適していると仮説を立てます。
さらに、現在のベンチマークは、セマンティクスよりも構文の正確さを優先する傾向があります。
これらのベンチマークに形態学的バイアスが存在すると、GVLM によって悪用される可能性があり、非効率な評価につながる可能性があります。
これに対処するために、形態学的バイアスを定量化するために MorphoBias スコアを定義し、バイアスを調整するための新しい LLM ベースの戦略を提案します。
さらに、構文の正確さへの固有の傾向に対する GVLM の堅牢性を評価するための困難なタスクが追加されています。
キャリブレーションされたデータセットとタスクを新しいベンチマーク、つまりMOrphological De-biased Benchmark (MODE)に組み込みます。
私たちの研究は、GVLM の構成性に関する最初の公平なベンチマークを提供し、この方向での将来の研究を促進します。
コードとデータセットをリリースします。

要約(オリジナル)

With the success of Large Language Models (LLMs), a surge of Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. The tuning recipe substantially deviates from the common contrastive vision-language learning. However, the performance of GVLMs in multimodal compositional reasoning remains largely unexplored, as existing evaluation metrics and benchmarks focus predominantly on assessing contrastive models like CLIP. In this paper, we examine the potential evaluation metrics to assess the GVLMs and hypothesize generative score methods are suitable for evaluating compositionality. In addition, current benchmarks tend to prioritize syntactic correctness over semantics. The presence of morphological bias in these benchmarks can be exploited by GVLMs, leading to ineffective evaluations. To combat this, we define a MorphoBias Score to quantify the morphological bias and propose a novel LLM-based strategy to calibrate the bias. Moreover, a challenging task is added to evaluate the robustness of GVLMs against inherent inclination toward syntactic correctness. We include the calibrated dataset and the task into a new benchmark, namely MOrphologicall De-biased Benchmark (MODE). Our study provides the first unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. We will release our code and datasets.

arxiv情報

著者	Teli Ma,Rong Li,Junwei Liang
発行日	2023-08-21 06:50:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Examination of the Compositionality of Large Generative Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー