Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

要約

ビデオアクションの理解と分析は、特にインテリジェント監視や自律システムなどのビデオベースのアプリケーションで、洞察に富んだコンテキスト化された説明を作成するために不可欠です。
提案された研究では、テキストと視覚のモダリティを組み合わせて、ビデオデータセットから自然言語の説明を生成するための新しいフレームワークを紹介します。
推奨されるアーキテクチャでは、ResNet50を使用して、Microsoft Research Video Description Corpus（MSVD）から取得したビデオフレームから視覚的な機能を抽出し、Berkeley Deepdrive説明（BDD-X）データセットを抽出します。
抽出された視覚特性は、パッチ埋め込みに変換され、生成事前トレーニングトランス2（GPT-2）に基づいてエンコーダーデコーダーモデルを介して実行されます。
テキストと視覚の表現を調整し、高品質の説明生産を保証するために、システムは多目的自己触たちとクロスアテンション技術を使用します。
モデルの有効性は、BLEU（1-4）、Cider、Meteor、およびRouge-Lを使用したパフォーマンス評価によって実証されています。
提案されたフレームワークは、0.755（BDD-X）および0.778（MSVD）のBLE-4スコア、1.235（BDD-X）および1.315（MSVD）のサイダースコア、0.312（BDD-X）および0.329（MSVD）およびRUEGOREスコアの0.312（BDD-X）および0.329（MSVD）、およびRouge-LSCORESの従来の方法よりも優れています。
0.795（MSVD）。
人間のような文脈的に関連する説明を生成し、解釈性を強化し、現実世界のアプリケーションを改善することにより、この研究は説明可能なAIを進めます。

要約(オリジナル)

Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model’s efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

arxiv情報

著者	Lakshita Agarwal,Bindu Verma
発行日	2025-04-23 15:03:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー