VisorGPT: Learning Visual Prior via Generative Pre-Training

要約

視覚データ内のさまざまなものや物は特定の特性を持っており、ディープニューラルネットワークによって学習でき、モデル内のオブジェクトの位置や形状などの視覚的な事前分布として暗黙的に表現されます。
このような事前の情報は、多くの視覚タスクに影響を与える可能性があります。
たとえば、条件付き画像合成では、空間条件が事前条件に準拠していない場合、視覚的に不正確な合成結果が生じる可能性があります。
この作業は、視覚的な事前学習を明示的に行い、サンプリングのカスタマイズを可能にすることを目的としています。
言語モデリングの進歩に触発され、VisorGPT と呼ばれる生成事前トレーニングを通じてビジュアルを事前に学習することを提案します。
オブジェクトの視覚的な位置 (境界ボックス、人間のポーズ、インスタンスマスクなど) をシーケンスに離散化することで、VisorGPT は尤度の最大化を通じて視覚的な事前モデルを作成できます。
さらに、さまざまな視覚的位置を統一し、学習した事前情報からの連続出力のカスタマイズされたサンプリングを可能にするプロンプトエンジニアリングが研究されています。
実験結果は、VisorGPT が視覚的な事前分布を効果的にモデル化できることを示しており、これは、ControlNet のような条件付き画像合成モデルの正確な人間のポーズのカスタマイズなど、多くの視覚タスクに使用できます。
コードは https://github.com/Sierkinhane/VisorGPT でリリースされます。

要約(オリジナル)

Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, e.g., bounding boxes, human pose, and instance masks, into sequences, VisorGPT can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.

arxiv情報

著者	Jinheng Xie,Kai Ye,Yudong Li,Yuexiang Li,Kevin Qinghong Lin,Yefeng Zheng,Linlin Shen,Mike Zheng Shou
発行日	2023-05-30 15:12:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisorGPT: Learning Visual Prior via Generative Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー