Deep Learning Approaches on Image Captioning: A Review

要約

画像キャプションは非常に重要な研究分野であり、静止画像の形式で視覚コンテンツの自然言語記述を生成することを目的としています。
ディープラーニングと、最近ではビジョン言語の事前トレーニング技術の出現により、この分野に革命が起こり、より洗練された手法とパフォーマンスの向上がもたらされました。
この調査論文では、包括的な分類を提示し、各手法カテゴリを詳細に説明することにより、画像キャプションにおける深層学習手法の構造的なレビューを提供します。
さらに、画像キャプションの研究で一般的に使用されるデータセットと、さまざまなキャプションモデルのパフォーマンスを評価するために使用される評価指標を調べます。
私たちは、物体の幻覚、文脈の欠如、照明条件、文脈の理解、表現の参照などの問題を強調することで、この分野で直面する課題に取り組みます。
広く使用されている評価指標に従ってさまざまな深層学習手法のパフォーマンスをランク付けし、現在の最先端技術についての洞察を提供します。
さらに、画像モダリティとテキストモダリティの間の情報の不整合問題への取り組み、データセットのバイアスの軽減、キャプション生成を強化するための視覚言語の事前トレーニング手法の組み込み、および
画像キャプションの品質を正確に測定します。

要約(オリジナル)

Image captioning is a research area of immense importance, aiming to generate natural language descriptions for visual content in the form of still images. The advent of deep learning and more recently vision-language pre-training techniques has revolutionized the field, leading to more sophisticated methods and improved performance. In this survey paper, we provide a structured review of deep learning methods in image captioning by presenting a comprehensive taxonomy and discussing each method category in detail. Additionally, we examine the datasets commonly employed in image captioning research, as well as the evaluation metrics used to assess the performance of different captioning models. We address the challenges faced in this field by emphasizing issues such as object hallucination, missing context, illumination conditions, contextual understanding, and referring expressions. We rank different deep learning methods’ performance according to widely used evaluation metrics, giving insight into the current state of the art. Furthermore, we identify several potential future directions for research in this area, which include tackling the information misalignment problem between image and text modalities, mitigating dataset bias, incorporating vision-language pre-training methods to enhance caption generation, and developing improved evaluation tools to accurately measure the quality of image captions.

arxiv情報

著者	Taraneh Ghandi,Hamidreza Pourreza,Hamidreza Mahyar
発行日	2023-08-22 17:50:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deep Learning Approaches on Image Captioning: A Review

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー