Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

要約

Temporal Sentence Grounding in Videos (TSGV) は、トリミングされていないビデオから自然言語クエリによって記述されたイベントタイムスタンプを検出することを目的としています。
このペーパーでは、高いパフォーマンスを維持しながら TSGV モデルで効率的な計算を達成するという課題について説明します。
既存のアプローチのほとんどは、追加のレイヤーと損失で精度を向上させるために複雑なアーキテクチャを絶妙に設計しており、非効率と重さに悩まされています。
一部の作品はそれに気づいていますが、機能融合レイヤーのみを問題にしており、不格好なネットワーク全体で高速性のメリットをほとんど享受できません。
この問題に取り組むために、異種ネットワークと同型ネットワークの両方から多様な知識を転送するための、知識蒸留に基づく新しい効率的なマルチ教師モデル (EMTM) を提案します。
具体的には、まず、異種モデルのさまざまな出力を 1 つの単一形式に統合します。
次に、複数の教師から高品質の統合ソフトラベルを取得するために、Knowledge Aggregation Unit (KAU) が構築されます。
その後、KAU モジュールはマルチスケールビデオとグローバルクエリ情報を活用して、さまざまな教師の重みを適応的に決定します。
次に、学生の浅い層が教師からほとんど恩恵を受けないという問題を解決するために、共有エンコーダ戦略が提案されます。この戦略では、同型教師が学生と協力して、隠れた状態を揃えるためにトレーニングされます。
3 つの一般的な TSGV ベンチマークに関する広範な実験結果は、私たちの方法が付加機能なしで効果的かつ効率的であることを示しています。

要約(オリジナル)

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire high-quality integrated soft labels from multiple teachers. After that, the KAU module leverages the multi-scale video and global query information to adaptively determine the weights of different teachers. A Shared Encoder strategy is then proposed to solve the problem that the student shallow layers hardly benefit from teachers, in which an isomorphic teacher is collaboratively trained with the student to align their hidden states. Extensive experimental results on three popular TSGV benchmarks demonstrate that our method is both effective and efficient without bells and whistles.

arxiv情報

著者	Renjie Liang,Yiming Yang,Hui Lu,Li Li
発行日	2023-08-07 17:07:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー