Localized Symbolic Knowledge Distillation for Visual Commonsense Models

要約

命令に従うビジョン言語 (VL) モデルは、幅広いマルチモーダルタスクをゼロショット方式でサポートする柔軟なインターフェイスを提供します。
ただし、完全な画像を操作するインターフェイスでは、ユーザーが画像内の特定の領域を「ポイント」してアクセスすることは直接できません。
この機能は、基準に基づいた VL ベンチマークをサポートするためだけでなく、正確な画像内推論を必要とする実際のアプリケーションにとっても重要です。
私たちは、ユーザーが入力として (複数の) 領域を指定できるローカライズされたビジュアルコモンセンスモデルを構築します。
大規模言語モデル (LLM) からローカライズされた常識知識をサンプリングすることによってモデルをトレーニングします。具体的には、一連の VL モデルによって自動的に生成されたグローバルリテラル画像記述とローカルリテラル領域記述が与えられた場合に、LLM に常識知識を収集するよう促します。
高品質の例を選択する個別にトレーニングされた批評家モデルを使用すると、ローカライズされた常識コーパスでのトレーニングにより、既存の VL モデルをうまく抽出して、入力としての参照インターフェイスをサポートできることがわかります。
ゼロショット設定での実験結果と人間による評価は、生成された参照式を LLM に渡すベースラインと比較して、私たちの蒸留方法により推論のより正確な VL モデルが得られることを示しています。

要約(オリジナル)

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to ‘point to’ and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.

arxiv情報

著者	Jae Sung Park,Jack Hessel,Khyathi Raghavi Chandu,Paul Pu Liang,Ximing Lu,Peter West,Youngjae Yu,Qiuyuan Huang,Jianfeng Gao,Ali Farhadi,Yejin Choi
発行日	2023-12-12 05:48:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー