GIT: A Generative Image-to-text Transformer for Vision and Language

要約

このホワイトペーパーでは、Generative Image-to-Text Transformer (GIT) を設計およびトレーニングして、画像/ビデオのキャプションや質問応答などの視覚言語タスクを統合します。
生成モデルは事前トレーニングと微調整の間で一貫したネットワークアーキテクチャを提供しますが、既存の作業には通常、複雑な構造 (ユニ/マルチモーダルエンコーダー/デコーダー) が含まれており、オブジェクト検出器/タガーや光学式文字認識 (OCR) などの外部モジュールに依存しています。
）。
GIT では、単一の言語モデリングタスクの下で、1 つの画像エンコーダーと 1 つのテキストデコーダーとしてアーキテクチャを簡素化します。
また、事前トレーニングデータとモデルサイズをスケールアップして、モデルのパフォーマンスを向上させます。
付加機能がなければ、当社の GIT は 12 の挑戦的なベンチマークで新しい最先端技術を確立し、大きな差をつけています。
たとえば、私たちのモデルは TextCaps で初めて人間のパフォーマンスを上回りました (CIDEr で 138.2 対 125.5)。
さらに、世代ベースの画像分類とシーンテキスト認識の新しいスキームを提示し、標準ベンチマークでまともなパフォーマンスを達成します。
コードは \url{https://github.com/microsoft/GenerativeImage2Text} で公開されています。

要約(オリジナル)

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

arxiv情報

著者	Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Gan,Zicheng Liu,Ce Liu,Lijuan Wang
発行日	2022-08-22 17:42:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GIT: A Generative Image-to-text Transformer for Vision and Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー