MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

要約

既存のMLLMベンチマークは、次のために統一されたMLLM（U-MLLM）を評価する際に大きな課題に直面しています。1）従来のタスクの標準化されたベンチマークの欠如、一貫性のない比較につながる。
2）混合モダリティ生成のためのベンチマークの欠如。これは、マルチモーダル推論機能を評価できません。
U-MLLMSを体系的に評価するように設計された包括的な評価フレームワークを提示します。
私たちのベンチマークには、標準化された従来のタスク評価が含まれます。
12のデータセットからサンプリングし、30のサブタスクを備えた10のタスクをカバーし、研究全体で一貫した公正な比較を確保します。
2。統一されたタスク評価。
画像編集、画像生成を備えた常識QA、幾何学的推論など、マルチモーダル推論をテストする5つの新しいタスクを紹介します。
3。包括的なモデルベンチマーク。
Janus-Pro、EMU3、Vila-U、Gemini2-Flashなどの12の主要なU-MLLMを、専門的な理解（Claude-3.5-Sonnetなど）および生成モデル（Dall-E-3など）とともに評価します。
私たちの調査結果は、既存のU-MLLMのかなりのパフォーマンスギャップを明らかにし、混合モダリティタスクを効果的に処理できるより堅牢なモデルの必要性を強調しています。
コードと評価データは、https：//mme-unify.github.io/にあります。

要約(オリジナル)

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies.’ 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

arxiv情報

著者	Wulin Xie,Yi-Fan Zhang,Chaoyou Fu,Yang Shi,Bingyan Nie,Hongkai Chen,Zhang Zhang,Liang Wang,Tieniu Tan
発行日	2025-04-07 16:12:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー