Class-attention Video Transformer for Engagement Intensity Prediction

要約

さまざまな長さの長いビデオに対処するために、以前の研究ではマルチモーダルな特徴を抽出し、それらを融合して学生の関与の強さを予測しました。
このホワイトペーパーでは、ビデオトランスフォーマー (CavT) でのクラスアテンションという新しいエンドツーエンドメソッドを紹介します。
-長さの短いビデオ。
さらに、十分なサンプルの不足に対処するために、各ビデオの複数のビデオシーケンスを追加してトレーニングセットを増強するバイナリ順序代表サンプリング法 (BorS) を提案します。
BorS+CavT は、EmotiW-EP データセットで最先端の MSE (0.0495) を達成するだけでなく、DAiSEE データセットで最先端の MSE (0.0377) も取得します。
コードとモデルは、https://github.com/mountainai/cavt で公開されています。

要約(オリジナル)

In order to deal with variant-length long videos, prior works extract multi-modal features and fuse them to predict students’ engagement intensity. In this paper, we present a new end-to-end method Class Attention in Video Transformer (CavT), which involves a single vector to process class embedding and to uniformly perform end-to-end learning on variant-length long videos and fixed-length short videos. Furthermore, to address the lack of sufficient samples, we propose a binary-order representatives sampling method (BorS) to add multiple video sequences of each video to augment the training set. BorS+CavT not only achieves the state-of-the-art MSE (0.0495) on the EmotiW-EP dataset, but also obtains the state-of-the-art MSE (0.0377) on the DAiSEE dataset. The code and models have been made publicly available at https://github.com/mountainai/cavt.

arxiv情報

著者	Xusheng Ai,Victor S. Sheng,Chunhua Li,Zhiming Cui
発行日	2022-11-10 14:17:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Class-attention Video Transformer for Engagement Intensity Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー