Linear Alignment of Vision-language Models for Image Captioning

要約

最近では、CLIP のような視覚言語モデルにより、画像キャプションやキャプション評価などのさまざまなマルチモーダルタスクにおいて最先端の技術が進歩しました。
多くのアプローチは、CLIP と言語モデルの間のマッピングネットワークをトレーニングすることにより、CLIP スタイルのモデルを下流のタスクに適応させます。
通常、大規模なモデルの勾配の計算が必要となるため、これにはコストがかかります。
私たちは、閉じた形式のソリューションを介して、CLIP の画像とテキストの埋め込み間の線形マッピングを適合させる、より効率的なトレーニングプロトコルを提案します。
これにより、勾配計算の必要性が回避され、既存の軽量メソッドよりも最大 1000 倍高速にトレーニングできる ReCap と呼ばれる軽量キャプションメソッドが実現します。
さらに、線形マッピングとともに CLIP スコアに基づいて構築される 2 つの新しい学習ベースの画像キャプション指標を提案します。
さらに、ReCap と新しいメトリクスを組み合わせて、合成キャプションに基づいた反復データストア拡張ループ (DAL) を設計します。
MS-COCO、Flickr30k、VizWiz、MSRVTT で ReCap を評価します。
ReCap は、確立されたメトリクスでは最先端の軽量メソッドと同等のパフォーマンスを達成しながら、Flickr8k-Expert および Flickr8k-Crowdflower での人間の評価とより一致する新しいメトリクスではパフォーマンスを上回ります。
最後に、ReCap が他のドメインにうまく移行し、DAL がパフォーマンスの向上につながることを示します。

要約(オリジナル)

Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches adapt CLIP-style models to a downstream task by training a mapping network between CLIP and a language model. This is costly as it usually involves calculating gradients for large models. We propose a more efficient training protocol that fits a linear mapping between image and text embeddings of CLIP via a closed-form solution. This bypasses the need for gradient computation and results in a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics that build on CLIP score along with our linear mapping. Furthermore, we combine ReCap with our new metrics to design an iterative datastore-augmentation loop (DAL) based on synthetic captions. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz, and MSRVTT. ReCap achieves performance comparable to state-of-the-art lightweight methods on established metrics while outperforming them on our new metrics, which are better aligned with human ratings on Flickr8k-Expert and Flickr8k-Crowdflower. Finally, we demonstrate that ReCap transfers well to other domains and that our DAL leads to a performance boost.

arxiv情報

著者	Fabian Paischer,Markus Hofmarcher,Sepp Hochreiter,Thomas Adler
発行日	2024-02-06 09:33:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Linear Alignment of Vision-language Models for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー