AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

要約

生のビデオは、多くの場合、フレームの一部だけが正確な認識の要件をすでに満たしているため、かなりの冗長性を備えていることが証明されています。
この論文では、このような冗長性を効果的に活用して、連続手話認識 (CSLR) における効率的な推論を促進できるかどうかに興味を持っています。
我々は、この問題を逐次決定タスクとしてモデル化することにより、入力ビデオシーケンスから最も有益なサブシーケンスを動的に選択する新しい適応モデル（AdaBrowse）を提案します。
具体的には、まず軽量ネットワークを利用して入力ビデオを迅速にスキャンし、粗い特徴を抽出します。
次に、これらの機能がポリシーネットワークに入力され、処理するサブシーケンスがインテリジェントに選択されます。
対応するサブシーケンスは最終的に、文予測用の通常の CSLR モデルによって推論されます。
この手順ではフレームの一部のみが処理されるため、合計の計算量を大幅に節約できます。
時間的冗長性に加えて、固有の空間的冗長性をシームレスに統合してさらなる効率を達成できるかどうか、つまり各サンプルの最低入力解像度を動的に選択できるかどうかにも関心があります。そのモデルは AdaBrowse+ と呼ばれています。
4 つの大規模 CSLR データセット (PHOENIX14、PHOENIX14-T、CSL-Daily、CSL) に関する広範な実験結果は、最先端の手法で 1.44 ドルのコストで同等の精度を達成することにより、AdaBrowse と AdaBrowse+ の有効性を実証しています。
$ のスループットと 2.12$\time$ 少ない FLOP です。
他の一般的に使用される 2D CNN および適応型の効率的な手法との比較により、AdaBrowse の有効性が検証されます。
コードは \url{https://github.com/hulianyuyy/AdaBrowse} で入手できます。

要約(オリジナル)

Raw videos have been proven to own considerable feature redundancy where in many cases only a portion of frames can already meet the requirements for accurate recognition. In this paper, we are interested in whether such redundancy can be effectively leveraged to facilitate efficient inference in continuous sign language recognition (CSLR). We propose a novel adaptive model (AdaBrowse) to dynamically select a most informative subsequence from input video sequences by modelling this problem as a sequential decision task. In specific, we first utilize a lightweight network to quickly scan input videos to extract coarse features. Then these features are fed into a policy network to intelligently select a subsequence to process. The corresponding subsequence is finally inferred by a normal CSLR model for sentence prediction. As only a portion of frames are processed in this procedure, the total computations can be considerably saved. Besides temporal redundancy, we are also interested in whether the inherent spatial redundancy can be seamlessly integrated together to achieve further efficiency, i.e., dynamically selecting a lowest input resolution for each sample, whose model is referred to as AdaBrowse+. Extensive experimental results on four large-scale CSLR datasets, i.e., PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaBrowse and AdaBrowse+ by achieving comparable accuracy with state-of-the-art methods with 1.44$\times$ throughput and 2.12$\times$ fewer FLOPs. Comparisons with other commonly-used 2D CNNs and adaptive efficient methods verify the effectiveness of AdaBrowse. Code is available at \url{https://github.com/hulianyuyy/AdaBrowse}.

arxiv情報

著者	Lianyu Hu,Liqing Gao,Zekang Liu,Chi-Man Pun,Wei Feng
発行日	2023-08-16 12:40:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AdaBrowse: Adaptive Video Browser for Efficient Continuous Sign Language Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー