The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

要約

大規模マルチモーダルモデル (LMM) の最近の進歩により、ビデオやオーディオなどの追加モダリティをさらに統合するための継続的な取り組みにより、さまざまなタスク全体のパフォーマンスが大幅に向上しました。
ただし、既存の LMM のほとんどは、幻覚、つまり事実のマルチモーダル入力と生成されたテキスト出力間の不一致に対して脆弱なままであり、これにより、現実世界のさまざまなシナリオでの適用が制限されています。
この論文は、言語、視覚、聴覚という 3 つの最も一般的なモダリティを含む LMM における幻覚の最初の系統的な調査を示しています。
私たちの研究では、幻覚の 2 つの主な要因、すなわち単峰性事前分布への過度の依存と擬似峰間の相関関係が明らかになりました。
これらの課題に対処するために、LMM の幻覚を包括的に評価し、根本的な問題の詳細な分析を提供するベンチマーク The Curse of Multi-Modality (CMM) を導入します。
私たちの調査結果は、モダリティ統合の不均衡やトレーニングデータからのバイアスなどの主要な脆弱性を浮き彫りにし、バランスのとれたクロスモーダル学習と強化された幻覚軽減戦略の必要性を強調しています。
私たちの観察と発見に基づいて、LMM の信頼性を高める可能性のある研究の方向性を提案します。

要約(オリジナル)

Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

arxiv情報

著者	Sicong Leng,Yun Xing,Zesen Cheng,Yang Zhou,Hang Zhang,Xin Li,Deli Zhao,Shijian Lu,Chunyan Miao,Lidong Bing
発行日	2024-10-16 17:59:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー