I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data

要約

質問の解析、セマンティクスの比較と対比、記述の記述など、コンピュータービジョンタスクに必要な多くの高度なスキルは、自然言語処理などの他の分野でも必要です。
この論文では、これによりテキストデータからこれらのスキルを学習し、それらを使用して視覚トレーニングデータでトレーニングすることなくビジョンタスクを完了することができるかどうかを尋ねます.
私たちのアプローチの鍵は、対照的に訓練されたビジョンと言語エンコーダーの共同埋め込みスペースを活用することです。
実際には、対照的なモデルのさまざまなモダリティの埋め込みスペースには体系的な違いがある可能性があり、これらの違いがアプローチにどのように影響するかを分析し、この懸念を軽減するためのさまざまな戦略を研究しています。
画像キャプション、視覚的含意、視覚的質問応答の 3 つのタスクでテキストトレーニングデータのみを使用してモデルを作成し、画像を使用して標準的なベンチマークで評価します。
この種の転送が可能であり、画像でトレーニングされたモデルと比較してパフォーマンスがわずかに低下するだけであることがわかりました。
また、画像データや人間がキュレーションした言語データを使用せずに、書籍、Web、または言語モデルからのテキストデータを使用してトレーニングされた、さまざまなスタイルの画像キャプションモデルも紹介します。

要約(オリジナル)

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

arxiv情報

著者	Sophia Gu,Christopher Clark,Aniruddha Kembhavi
発行日	2022-11-17 18:52:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

I Can’t Believe There’s No Images! Learning Visual Tasks Using only Language Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー