Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

要約

直接優先最適化 (DPO) などの優先モデリング手法は、大規模言語モデル (LLM) の一般化能力を強化するのに効果的であることが示されています。
しかし、ビデオ指示に従うタスクでは、特に生成された応答内の幻覚を検出するために、有益なフィードバックを提供することが依然として大きな課題です。
これまでの研究では、嗜好モデリングをガイドする報酬モデルとして大規模マルチモーダルモデル (LMM) の使用が検討されてきましたが、対応するビデオと比較して、生成された応答の事実性を正確に評価する LMM の機能は最終的に確立されていませんでした。
この論文では、ビデオコンテンツのプロキシとして詳細なビデオキャプションを利用する新しいフレームワークを紹介します。これにより、ビデオの質問応答 (QA) 予測をスコアリングするための裏付け証拠として言語モデルにこの情報を組み込むことが可能になります。
私たちのアプローチは、ビデオフレームを入力として直接受け取る OpenAI GPT-4V モデルの報酬メカニズムとの堅牢な連携を示しています。
さらに、DPO を通じてこの調整された報酬を適用すると、ビデオ QA タスクにおけるビデオ LMM のパフォーマンスが大幅に向上することを示します。

要約(オリジナル)

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model’s reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.

arxiv情報

著者	Ruohong Zhang,Liangke Gui,Zhiqing Sun,Yihao Feng,Keyang Xu,Yuanhan Zhang,Di Fu,Chunyuan Li,Alexander Hauptmann,Yonatan Bisk,Yiming Yang
発行日	2024-04-02 12:47:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー