Vision Transformer Based Model for Describing a Set of Images as a Story

要約

ビジュアルストーリーテリングは、一連の画像から複数の文章からなるストーリーを形成するプロセスです。
入力画像内に取り込まれた視覚的なバリエーションやコンテキスト情報を適切に含めることは、ビジュアルストーリーテリングの最も難しい側面の 1 つです。
その結果、一連の画像から展開されたストーリーには、一貫性、関連性、意味上の関係が欠けていることがよくあります。
この論文では、一連の画像をストーリーとして記述するための新しいビジョントランスフォーマーベースのモデルを提案します。
提案手法は、Vision Transformer (ViT) を使用して入力画像の明確な特徴を抽出します。
まず、入力画像が 16X16 のパッチに分割され、平坦化されたパッチの線形投影にバンドルされます。
単一の画像から複数の画像パッチへの変換により、入力視覚パターンの視覚的な多様性が取り込まれます。
これらの機能は、シーケンスエンコーダの一部である双方向 LSTM への入力として使用されます。
これにより、すべての画像パッチの過去および将来の画像コンテキストがキャプチャされます。
次に、アテンションメカニズムが実装され、言語モデル、つまり Mogrifier-LSTM に供給されるデータの識別能力を高めるために使用されます。
私たちが提案したモデルのパフォーマンスは、Visual Story-Telling データセット (VIST) を使用して評価され、その結果は、私たちのモデルが現在の最先端モデルよりも優れていることを示しています。

要約(オリジナル)

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

arxiv情報

著者	Zainy M. Malakan,Ghulam Mubashar Hassan,Ajmal Mian
発行日	2023-07-14 08:42:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision Transformer Based Model for Describing a Set of Images as a Story

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー