VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

要約

大規模な言語モデル（LLM）に基づいて構築された大規模なビデオモデル（LVM）は、ビデオの理解に有望を示していますが、しばしば人間の直観とビデオの幻覚の問題との不整合に苦しんでいます。
これらの課題に対処するために、Vistadpoを紹介します。Vistadpoは、ビデオ階層的空間的直接優先嗜好の最適化のための新しいフレームワークです。
Vistadpoは、3つの階層レベルにわたってテキストビデオ優先アラインメントを強化します。i）インスタンスレベル、全体的なビデオコンテンツを応答に合わせます。
ii）時間レベル、ビデオの時間的セマンティクスをイベントの説明と調整する。
およびiii）空間オブジェクトに言語トークンを調整する知覚レベル。
きめ細かいビデオ言語優先アラインメントのデータセットがないことを考えると、選択された回答と拒否された応答が注釈が付けられた7.2K QAペアのデータセットと、タイムスタンプ、キーフレーム、バウンドボックスなどの空間的な接地情報を構築します。
ビデオ幻覚、ビデオQA、キャプションのパフォーマンスタスクなどのベンチマークに関する広範な実験は、Vistadpoが既存のLVMのパフォーマンスを大幅に改善し、ビデオ言語の不整合と幻覚を効果的に緩和することを示しています。
コードとデータは、https：//github.com/haroldchen19/vistadpoで入手できます。

要約(オリジナル)

Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at https://github.com/HaroldChen19/VistaDPO.

arxiv情報

著者	Haojian Huang,Haodong Chen,Shengqiong Wu,Meng Luo,Jinlan Fu,Xinya Du,Hanwang Zhang,Hao Fei
発行日	2025-04-17 17:39:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー