Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

要約

トレーニング前の標準的な大きな視覚言語モデル（LVLMS）では、モデルは通常、次のトークン予測（NTP）を介して画像に条件付けられたキャプションの共同確率を最大化します。
ただし、キャプショントークンの小さなサブセットのみが視覚コンテンツに直接関係しているため、この素朴なNTPは意図せずにノイズに適合し、幻覚のリスクを高めます。
重要なサンプリングフレームワークから引き出され、NTP損失の差動重み付けを通じて、画像関連のトークンに優先順位を付けることにより、この問題に対処する単純なビジョン言語の事前トレーニングアプローチを提示します。
Priorは、LVLMSトレーニングの確率に基づいて各トークンを重量化するために、画像入力なしでキャプションでトレーニングされたテキストのみの大型言語モデル（LLM）を参照モデルを導入します。
直感的には、視覚入力に直接関連するトークンは、画像なしでは予測するのが難しく、したがって、テキストのみの参照LLMからより低い確率を受け取ります。
トレーニング中、各トークンの損失を調整する重要なスコアに基づいて、トークン固有の再重視用語を実装します。
2つの異なる設定で事前に実装します：Visual Encodersを使用したLVLMSとVisual EncodersなしのLVLMS。
NTPと比較して、いくつかの視覚言語ベンチマークで、それぞれ19％および8％の平均相対改善が観察されます。
さらに、以前のスケーリング係数が大幅に高いスケーリング係数によって示されるように、以前のスケーリング特性を示しており、計算とデータの増加が与えられたNTPと比較してパフォーマンスの向上の可能性が高いことを示しています。

要約(オリジナル)

In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token’s loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.

arxiv情報

著者	Yangyi Chen,Hao Peng,Tong Zhang,Heng Ji
発行日	2025-05-13 21:27:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー