CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

要約

オブジェクト中心の表現学習は、視覚的なシーンを「スロット」または「オブジェクトファイル」と呼ばれる固定サイズのベクトルに分解することを目的としています。そこでは、各スロットが異なるオブジェクトをキャプチャします。
現在の最先端のオブジェクト中心のモデルは、複雑な現実世界のシーンを含む多様なドメインでのオブジェクト発見において顕著な成功を示しています。
ただし、これらのモデルは重要な制限に悩まされています。制御可能性が欠けています。
具体的には、現在のオブジェクト中心のモデルは、ユーザー入力がどのオブジェクトを表現するかをガイドすることなく、オブジェクトの先入観に基づいて表現を学習します。
オブジェクト中心のモデルに制御可能性を導入すると、シーンからインスタンス固有の表現を抽出する機能など、さまざまな有用な機能のロックを解除できます。
この作業では、言語の説明にスロットを調整することにより、スロット表現をユーザー指向した制御のための新しいアプローチを提案します。
私たちがCTRL-Oと呼ぶ制御可能なオブジェクト中心の表現学習アプローチは、マスクの監督を必要とせずに複雑な現実世界のシーンでターゲットを絞ったオブジェクト言語結合を達成します。
次に、これらの制御可能なスロット表現を、テキストからイメージの生成と視覚的な質問応答という2つの下流のビジョン言語タスクに適用します。
提案されたアプローチは、インスタンス固有のテキストからイメージへの生成を可能にし、視覚的な質問応答でも強力なパフォーマンスを達成します。

要約(オリジナル)

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called ‘slots’ or ‘object files’, where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

arxiv情報

著者	Aniket Didolkar,Andrii Zadaianchuk,Rabiul Awal,Maximilian Seitzer,Efstratios Gavves,Aishwarya Agrawal
発行日	2025-03-27 17:53:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー