Event and Entity Extraction from Generated Video Captions

要約

人間によるマルチメディアデータの注釈付けには時間とコストがかかり、セマンティックメタデータの信頼性の高い自動生成は大きな課題です。
自動生成されたビデオキャプションからセマンティックメタデータを抽出するフレームワークを提案します。
メタデータとして、エンティティ、エンティティのプロパティ、エンティティ間の関係、およびビデオカテゴリを考慮します。
マスクトトランスフォーマー (MT) と並列デコード (PVDC) を備えた 2 つの最先端の高密度ビデオキャプションモデルを採用して、ActivityNet Captions データセットのビデオのキャプションを生成します。
私たちの実験では、生成されたキャプションからエンティティ、そのプロパティ、エンティティ間の関係、およびビデオカテゴリを抽出できることが示されました。
抽出された情報の品質は、主にビデオ内のイベントの位置特定の品質とイベントキャプション生成のパフォーマンスに影響されることがわかります。

要約(オリジナル)

Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities’ properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.

arxiv情報

著者	Johannes Scherer,Ansgar Scherp,Deepayan Bhowmik
発行日	2023-09-13 14:49:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Event and Entity Extraction from Generated Video Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー