Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

要約

大規模なマルチモダリティモデル（LMM）は、視覚的理解と生成に大きな進歩を遂げていますが、特に複雑な指示に従い、外観の一貫性を維持し、柔軟な入力形式をサポートする際に、一般的な視覚編集で依然として課題に直面しています。
このギャップに対処するために、推論に基づいた視覚編集（Rise）を評価するための最初のベンチマークであるRiseBenchを紹介します。
RiseBenchは、時間、因果、空間、および論理的推論の4つの重要な推論タイプに焦点を当てています。
各カテゴリの高品質のテストケースをキュレートし、人間の裁判官とLMMとしてのジャッジアプローチの両方で、指導の推論、外観の一貫性、および視覚的妥当性を評価する評価フレームワークを提案します。
私たちの実験では、GPT-4o-nativeは他のオープンソースや独自のモデルを大幅に上回っていますが、この最先端のシステムでさえ、目立たない範囲の領域を強調している論理的推論タスクと格闘しています。
最初の努力として、Risebenchは、推論を認識した視覚編集に関する基礎的な洞察を提供し、将来の研究を触媒することを目指しています。
まだ初期段階にありますが、ベンチマークを継続的に拡大および改良して、次世代マルチモーダルシステムのより包括的で信頼性の高いスケーラブルな評価をサポートすることに取り組んでいます。
コードとデータはhttps://github.com/phoenixz810/risebenchでリリースされます。

要約(オリジナル)

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.

arxiv情報

著者	Xiangyu Zhao,Peiyuan Zhang,Kexian Tang,Hao Li,Zicheng Zhang,Guangtao Zhai,Junchi Yan,Hua Yang,Xue Yang,Haodong Duan
発行日	2025-04-08 16:43:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー