Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

要約

テキストから画像へ (T2I) 生成のための大規模マルチモーダルモデル (LMM) を強化するこれまでの研究は、インコンテキスト学習 (ICL) の入力空間を強化することに焦点を当てていました。
これには、いくつかのデモンストレーションを提供し、画像の説明をより詳細かつ論理的に最適化することが含まれます。
しかし、より複雑で柔軟な画像記述に対する需要が高まる中、ICL パラダイム内での入力テキストの理解を強化することは、依然として重要でありながらも十分に検討されていない領域です。
この研究では、LMM の多言語機能を活用することを目的とした並列多言語プロンプトを構築することで、この研究分野を拡張します。
より具体的には、入力テキストをいくつかの言語に翻訳し、元のテキストと翻訳の両方をモデルに提供します。
3 つのベンチマークにわたる 2 つの LMM での実験では、私たちの手法である PMT2I が、特に人間の好みの調整において、一般的、構成的、きめ細かい評価において優れたパフォーマンスを達成することが示されています。
さらに、PMT2I は、より多様な画像を生成するという利点があるため、再ランキング手法と組み込むと、ベースラインプロンプトよりも大幅に優れたパフォーマンスを発揮します。
私たちのコードと並列多言語データは https://github.com/takagi97/PMT2I で見つけることができます。

要約(オリジナル)

Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at https://github.com/takagi97/PMT2I.

arxiv情報

著者	Yongyu Mu,Hengyu Li,Junxin Wang,Xiaoxuan Zhou,Chenglong Wang,Yingfeng Luo,Qiaozhi He,Tong Xiao,Guocheng Chen,Jingbo Zhu
発行日	2025-01-13 06:41:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー