Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

要約

Direct Preference Optimization (DPO) は、出力を人間の好みに合わせて調整することで、Large Vision Language Model (LVLM) の幻覚を軽減するのに非常に効果的であることが実証されています。
最近の進歩にもかかわらず、既存の方法には 2 つの欠点があります。1) スケーラブルなトークンレベルの報酬が不足している。
2) 視覚的にアンカーされたトークンの無視。
この目的を達成するために、私たちは、きめの細かい注釈なしで視覚的に相関するトークンに適応的に対応する、自己調整された報酬 (TPO と呼ばれる) を備えた新しいトークン優先最適化モデルを提案します。
具体的には、生の画像と破損した画像を条件として生成されたトークンのロジスティック分布の差として、トークンレベルの \emph{visual-anchored} \emph{reward} を導入します。
さらに、視覚的にアンカーされた有益なトークンを強調するために、より正確なトークンレベルの最適化を強化するために視覚を意識したトレーニング目標が提案されています。
広範な実験結果により、提案された TPO の最先端のパフォーマンスが明らかになりました。
たとえば、LLAVA-1.5-7B の上に構築することで、当社の TPO は幻覚ベンチマークのパフォーマンスの絶対的な向上を促進します。

要約(オリジナル)

Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual-correlated tokens without fine-grained annotations. Specifically, we introduce a token-level \emph{visual-anchored} \emph{reward} as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLAVA-1.5-7B, our TPO boosts the performance absolute improvement for hallucination benchmarks.

arxiv情報

著者	Jihao Gu,Yingyao Wang,Meng Cao,Pi Bu,Jun Song,Yancheng He,Shilong Li,Bo Zheng
発行日	2025-01-02 07:39:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー