GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

要約

既存の画像生成および編集方法によって達成された成功にもかかわらず、現在のモデルは複雑なテキストプロンプトなどの複雑な問題に依然として悩まされており、検証および自己修正メカニズムが欠如しているため、生成された画像の信頼性が低くなります。
一方、単一のモデルでは特定のタスクに特化し、対応する機能を備えている傾向があり、すべてのユーザー要件を満たすには不十分です。
我々は、マルチモーダル大規模言語モデル (MLLM) エージェントによって調整される統合画像生成および編集システムである GenArtist を提案します。
当社は、既存のモデルの包括的な範囲をツールライブラリに統合し、ツールの選択と実行にエージェントを利用します。
複雑な問題の場合、MLLM エージェントはそれをより単純なサブ問題に分解し、ツリー構造を構築して、生成、編集、自己修正の手順を段階的に検証しながら体系的に計画します。
欠落している位置関連の入力を自動的に生成し、位置情報を組み込むことにより、適切なツールを効果的に使用して各部分問題に対処できます。
実験では、図 1 に見られるように、GenArtist がさまざまな生成および編集タスクを実行でき、最先端のパフォーマンスを達成し、SDXL や DALL-E 3 などの既存のモデルを上回ることが実証されました。プロジェクトページは https://
zhenyuw16.github.io/GenArtist_page。

要約(オリジナル)

Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.

arxiv情報

著者	Zhenyu Wang,Aoxue Li,Zhenguo Li,Xihui Liu
発行日	2024-10-28 14:08:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー