Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

要約

マルチモーダル言語分析は、複数のモダリティを活用して、人間の会話の発話の根底にある高レベルのセマンティクスの理解を高めるための急速に進化する分野です。
その重要性にもかかわらず、認知レベルのセマンティクスを理解するためのマルチモーダル大手言語モデル（MLLM）の能力を調査していません。
この論文では、このギャップに対処するために特別に設計された包括的なベンチマークであるMMLAを紹介します。
MMLAは、段階的なシナリオと現実世界の両方のシナリオから引き出された61k以上のマルチモーダル発話で構成され、マルチモーダルセマンティクスの6つのコアディメンションをカバーしています。
ゼロショット推論、監視付き微調整、および命令チューニングの3つの方法を使用して、LLMSとMLLMSの8つの主流分岐を評価します。
広範な実験により、微調整されたモデルでさえ、約60％〜70％の精度しか達成されておらず、複雑な人間の言語を理解する際の現在のMLLMの制限を強調していることが明らかになりました。
MMLAは、マルチモーダル言語分析における大規模な言語モデルの可能性を調査し、この分野を進めるための貴重なリソースを提供するための強固な基盤として役立つと考えています。
データセットとコードは、https：//github.com/thuiar/mmlaでオープンソーリングされています。

要約(オリジナル)

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

arxiv情報

著者	Hanlei Zhang,Zhuohang Li,Yeshuang Zhu,Hua Xu,Peiwu Wang,Haige Zhu,Jie Zhou,Jinchao Zhang
発行日	2025-04-24 07:35:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー