DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

要約

Temporal Language Grounding は、自然言語クエリに意味的に対応するビデオの瞬間をローカライズすることを目指します。
最近の進歩では、ビデオの瞬間とテキストクエリの間の関係を学習するためにアテンションメカニズムが採用されています。
ただし、単純な注意力ではそのような関係を適切に捉えることができない可能性があり、その結果、ビデオのターゲットの瞬間を残りの瞬間から区別することが困難な非効率的な配信が行われる可能性があります。
この問題を解決するために、モーメントクエリ分布を明示的に学習するためのエネルギーベースのモデルフレームワークを提案します。
さらに、学習可能な減衰係数を備えた指数移動平均を利用してモーメントクエリ入力を効果的にエンコードする、新しい Transformer ベースのアーキテクチャである DemaFormer を提案します。
4 つの公共時間言語グラウンディングデータセットに対する包括的な実験により、最先端のベースラインに対する私たちの手法の優位性が実証されました。

要約(オリジナル)

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

arxiv情報

著者	Thong Nguyen,Xiaobao Wu,Xinshuai Dong,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan
発行日	2023-12-05 07:37:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー