Video Captioning with Guidance of Multimodal Latent Topics

要約

オープンドメインビデオのトピックの多様性は、ビデオコンテンツを説明する際にさまざまな語彙や言語表現につながるため、ビデオキャプションタスクをさらに困難にします。
この論文では、データから教師なしの方法でマルチモーダルトピックをマイニングし、これらのトピックでキャプションデコーダをガイドする、統一されたキャプションフレームワーク、M&M TGM を提案します。
事前定義されたトピックと比較して、マイニングされたマルチモーダルトピックは意味的にも視覚的にも一貫性があり、ビデオのトピック分布をよりよく反映できます。
トピック認識キャプション生成をマルチタスク学習問題として定式化し、キャプションタスクに加えて、並列タスクであるトピック予測を追加します。
トピック予測タスクでは、マイニングされたトピックを教師として使用して、ビデオのマルチモーダルコンテンツから潜在的なトピックを予測することを学習する学生トピック予測モデルをトレーニングします。
トピック予測は、学習プロセスに中間的な監督を提供します。
キャプションタスクに関しては、潜在的なトピックからのガイダンスにより、より正確で詳細なビデオの説明を生成するための新しいトピック認識デコーダーを提案します。
学習手順全体はエンドツーエンドであり、両方のタスクを同時に最適化します。
MSR-VTT および Youtube2Text データセットで実施された広範な実験の結果は、提案されたモデルの有効性を示しています。
M&M TGM は、複数の評価指標と両方のベンチマークデータセットで従来の最先端の方法より優れているだけでなく、より優れた一般化機能も実現しています。

要約(オリジナル)

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data and guides the caption decoder with these topics. Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent and can reflect the topic distribution of videos better. We formulate the topic-aware caption generation as a multi-task learning problem, in which we add a parallel task, topic prediction, in addition to the caption task. For the topic prediction task, we use the mined topics as the teacher to train a student topic prediction model, which learns to predict the latent topics from multimodal contents of videos. The topic prediction provides intermediate supervision to the learning process. As for the caption task, we propose a novel topic-aware decoder to generate more accurate and detailed video descriptions with the guidance from latent topics. The entire learning procedure is end-to-end and it optimizes both tasks simultaneously. The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model. M&M TGM not only outperforms prior state-of-the-art methods on multiple evaluation metrics and on both benchmark datasets, but also achieves better generalization ability.

arxiv情報

著者	Shizhe Chen,Jia Chen,Qin Jin,Alexander Hauptmann
発行日	2023-02-14 17:11:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Captioning with Guidance of Multimodal Latent Topics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー