MonetGPT: Solving Puzzles Enhances MLLMs’ Image Retouching Skills

要約

レタッチは、生の写真の操作後の不可欠な作業です。
テキストまたはストロークに導かれた生成編集は、ユーザーがアクセスできる新しいツールを提供しますが、容認できない予測不可能な方法で元のオブジェクトの身元を簡単に変更できます。
対照的に、写真編集ツール（Gimp、Lightroomなど）で一般的にサポートされている従来の手続き的編集は保守的ですが、専門家にはまだ好まれています。
残念ながら、プロの品質のレタッチには、ほとんどの初心者の計画が困難な個々の手続き上の編集操作が含まれます。
このホワイトペーパーでは、マルチモーダルの大手言語モデル（MLLM）を、生の写真を批評し、適切な救済策を提案し、最終的に一連の著者の手続き上の画像操作でそれらを実現できるかどうかを尋ねます。
特別に設計された視覚パズルを解決するためにトレーニングすることにより、MLLMが基礎となる画像処理操作を最初に認識できることを実証します。
その後、このような操作認識MLLMは、編集シーケンスを計画および提案することができます。
トレーニングを容易にするために、専門家が編集した写真のセットを考慮して、専門家の編集を手続き的に操作し、視覚的調整で前提条件のLLMを接地し、微調整の推論を合成することにより、推論データセットを統合します。
提案されたレタッチ操作は、建設により、ユーザーが理解できるように、オブジェクトの詳細と解像度を保存し、オプションでオーバーライドできます。
さまざまなテスト例でセットアップを評価し、説明可能性とアイデンティティの保存の観点から、既存の生成的およびその他の手続き上の代替品よりも利点を示します。
コード、データ、モデル、および補足結果は、プロジェクトWebサイトhttps://monetgpt.github.ioから見つけることができます。

要約(オリジナル)

Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives. Code, data, models, and supplementary results can be found via our project website at https://monetgpt.github.io.

arxiv情報

著者	Niladri Shekhar Dutt,Duygu Ceylan,Niloy J. Mitra
発行日	2025-05-09 16:38:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MonetGPT: Solving Puzzles Enhances MLLMs’ Image Retouching Skills

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー