QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

要約

長いビデオ理解における最近の進歩は、通常、注意分布に基づいて視覚トークン剪定を通じて視覚的冗長性を軽減します。
ただし、既存の方法では、デコーダー層での事後の低応答トークンプルーニングを使用していますが、視覚トークンと命令（クエリ）の間の入力レベルのセマンティック相関を見落としています。
このホワイトペーパーでは、クォータを提案します。クォータは、クエリ指向のフレームレベルの重要性評価に基づいて、視覚トークンの割り当て用に既存の大きなビデオ言語モデル（LVLMS）を拡張するアンティホックトレーニングフリーモジュラーです。
クエリ指向のトークン選択は、視覚処理をタスク固有の要件と整列させ、意味的に関連するコンテンツを保存しながらトークン予算の利用を最適化するため、重要です。
具体的には、（i）クォータは、クエリの関連性に基づいてフレームレベルの重要性スコアを戦略的に割り当て、デコーダー層でのクロスモーダルインタラクションの前に1回限りの視覚トークン割り当てを可能にします。
広範な実験結果は、LLAVA-Video-7Bでクォータを実装すると、ベースラインと同じ視覚トークン予算内で動作しながら、6つのベンチマーク（ビデオMMEおよびMLVUを含む）にわたって平均パフォーマンス改善が得られることを示しています。
コードはhttps://github.com/mac-automl/quotaでオープンソーシングされています。

要約(オリジナル)

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

arxiv情報

著者	Yongdong Luo,Wang Chen,Xiawu Zheng,Weizhong Huang,Shukang Yin,Haojia Lin,Chaoyou Fu,Jinfa Huang,Jiayi Ji,Jiebo Luo,Rongrong Ji
発行日	2025-03-11 17:59:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー