Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

要約

長いビデオコンテンツを理解することは、多くの場合、密にサンプリングされたフレームキャプションまたはエンドツーエンドの機能セレクターに依存する複雑な努力ですが、これらの手法は一般に、テキストクエリと視覚要素の間の論理的な関係を見落としています。
実際には、計算上の制約では、「干し草の山で針を見つける」に類似した課題である粗いフレームのサブサンプリングが必要です。この問題に対処するために、視覚的なセマンティックロジカル検索のパラダイムの下でキーフレーム選択を再編成するセマンティクス駆動型の検索フレームワークを導入します。
具体的には、4つの基本的な論理依存関係を体系的に定義します。1）空間的共起、2）時間的近接、3）属性依存関係、および4）因果順序。
これらの関係は、反復的な改良プロセスを介してフレームサンプリング分布を動的に更新し、特定のクエリ要件に合わせたセマンティックに重要なフレームのコンテキストを意識した識別を可能にします。
私たちの方法は、キーフレーム選択メトリックの手動注釈付きベンチマークで新しいSOTAパフォーマンスを確立します。
さらに、下流のビデオ質問タスクに適用されると、提案されたアプローチは、LongvideobenchとビデオMMEの既存の方法よりも最高のパフォーマンスの向上を実証し、テキストクエリと視覚的に優しい推論の間の論理的ギャップを埋める際の有効性を検証します。
コードは公開されます。

要約(オリジナル)

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to “finding a needle in a haystack.” To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

arxiv情報

著者	Weiyu Guo,Ziyang Chen,Shaoguang Wang,Jianxiang He,Yijie Xu,Jinhui Ye,Ying Sun,Hui Xiong
発行日	2025-03-17 13:07:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー