An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)

要約

エンコーダ・デコーダの枠組みによる画像キャプションは、主にCNNがエンコーダとして、LSTMがデコーダとして使用され、過去10年間で驚異的な進歩を見せている。単純な画像では精度の面でこのような素晴らしい成果を上げていますが、時間複雑性と空間複雑性の効率性という点で不足しています。また、情報量やオブジェクトが多い複雑な画像の場合、画像中のシーンを意味的に理解できないため、このCNN-LSTMのペアの性能は指数関数的に低下します。そこで、これらの問題を考慮し、時間複雑性とともに意味的文脈を考慮したキャプションから画像への再構成のためのCNN-GRUエンコーダ・デコードフレームワークを提案する。デコーダの隠れ状態を考慮することで、入力画像とその類似の意味表現を再構成し、意味再構成器からの再構成スコアをモデル学習時の尤度と合わせて使用し、生成されたキャプションの品質を評価する。その結果、デコーダは改善された意味情報を受け取り、キャプション生成プロセスを向上させる。モデルのテスト時には、再構成スコアと対数尤度を組み合わせることで、最も適切なキャプションを選択することも可能である。提案モデルは、画像キャプションのための最新鋭のLSTM-A5モデルを、時間複雑性と精度の点で凌駕している。

要約(オリジナル)

Image captioning by the encoder-decoder framework has shown tremendous advancement in the last decade where CNN is mainly used as encoder and LSTM is used as a decoder. Despite such an impressive achievement in terms of accuracy in simple images, it lacks in terms of time complexity and space complexity efficiency. In addition to this, in case of complex images with a lot of information and objects, the performance of this CNN-LSTM pair downgraded exponentially due to the lack of semantic understanding of the scenes presented in the images. Thus, to take these issues into consideration, we present CNN-GRU encoder decode framework for caption-to-image reconstructor to handle the semantic context into consideration as well as the time complexity. By taking the hidden states of the decoder into consideration, the input image and its similar semantic representations is reconstructed and reconstruction scores from a semantic reconstructor are used in conjunction with likelihood during model training to assess the quality of the generated caption. As a result, the decoder receives improved semantic information, enhancing the caption production process. During model testing, combining the reconstruction score and the log-likelihood is also feasible to choose the most appropriate caption. The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.

arxiv情報

著者	Rana Adnan Ahmad,Muhammad Azhar,Hina Sattar
発行日	2023-01-06 10:00:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー