Compile Scene Graphs with Reinforcement Learning

要約

次のトークン予測は、大規模な言語モデル（LLMS）をトレーニングするための基本原則であり、強化学習（RL）は推論パフォーマンスをさらに向上させます。
言語、画像、ビデオ、およびその他のモダリティをモデル化する効果的な方法として、シーングラフなどの構造化された視覚表現のエンドツーエンドの抽出にLLMを使用することは、既知のままです。
トークンによるテキストトークンを生成するのではなく、モデルがオブジェクトと関係トリプレットのセットを正確に生成する必要があります。
これを達成するために、シーングラフデータセットで監視された微調整（SFT）を介して最初にトレーニングされたマルチモーダルLLM（M-LLM）であるR1-SGGを紹介し、その後、強化学習を使用して洗練され、エンドツーエンドの方法でシーングラフを生成する能力を強化します。
SFTは従来の迅速な応答パラダイムに従いますが、RLには効果的な報酬信号の設計が必要です。
シーングラフの構造化された性質を考えると、ノードレベルの報酬、エッジレベルの報酬、および形式の一貫性報酬を統合するグラフ中心の報酬関数を設計します。
私たちの実験は、ルールベースのRLがSGGタスクのモデルパフォーマンスを大幅に向上させ、ゼロの故障率を達成することを示しています。
私たちのコードは、https：//github.com/gpt4vision/r1-sggで入手できます。

要約(オリジナル)

Next token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. Given the structured nature of scene graphs, we design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward. Our experiments demonstrate that rule-based RL substantially enhances model performance in the SGG task, achieving a zero failure rate–unlike supervised fine-tuning (SFT), which struggles to generalize effectively. Our code is available at https://github.com/gpt4vision/R1-SGG.

arxiv情報

著者	Zuyao Chen,Jinlin Wu,Zhen Lei,Marc Pollefeys,Chang Wen Chen
発行日	2025-04-18 10:46:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Compile Scene Graphs with Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー