ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

要約

マルチモーダル大規模言語モデル(LLM)は、テキスト指示から高品質な画像を生成する上で素晴らしい能力を発揮してきた。しかし、科学の進歩を加速するために重要なアプリケーションである科学画像の生成におけるその性能は、まだ十分に研究されていない。本研究では、テキスト記述から科学画像を生成する際のLLMのマルチモーダル能力を評価するために設計されたベンチマークであるScImageを導入することで、このギャップに対処する。ScImageは、空間理解、数値理解、属性理解の3つの主要な次元とその組み合わせを評価する。GPT-4o、Llama、AutomaTikZ、Dall-E、StableDiffusionの5つのモデルを、コードベースの出力（Python、TikZ）とラスター画像の直接生成という2つの出力モードを使って評価する。さらに、4つの異なる入力言語を検証した：英語、ドイツ語、ペルシャ語、中国語である。GPT-4oは、空間的、数値的、または属性的な理解といった個々の次元を含むより単純なプロンプトに対しては、適切な品質の出力を生成する一方で、特により複雑なプロンプトに対しては、すべてのモデルがこのタスクで課題に直面していることが明らかになった。

要約(オリジナル)

Multimodal large language models (LLMs) have demonstrated impressive capabilities in generating high-quality images from textual instructions. However, their performance in generating scientific images–a critical application for accelerating scientific progress–remains underexplored. In this work, we address this gap by introducing ScImage, a benchmark designed to evaluate the multimodal capabilities of LLMs in generating scientific images from textual descriptions. ScImage assesses three key dimensions of understanding: spatial, numeric, and attribute comprehension, as well as their combinations, focusing on the relationships between scientific objects (e.g., squares, circles). We evaluate five models, GPT-4o, Llama, AutomaTikZ, Dall-E, and StableDiffusion, using two modes of output generation: code-based outputs (Python, TikZ) and direct raster image generation. Additionally, we examine four different input languages: English, German, Farsi, and Chinese. Our evaluation, conducted with 11 scientists across three criteria (correctness, relevance, and scientific accuracy), reveals that while GPT-4o produces outputs of decent quality for simpler prompts involving individual dimensions such as spatial, numeric, or attribute understanding in isolation, all models face challenges in this task, especially for more complex prompts.

arxiv情報

著者	Leixin Zhang,Steffen Eger,Yinjie Cheng,Weihe Zhai,Jonas Belouadi,Christoph Leiter,Simone Paolo Ponzetto,Fahimeh Moafian,Zhixue Zhao
発行日	2024-12-03 10:52:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー