Vision-Language Dataset Distillation

要約

データセット蒸留法は、大規模なデータセットをより小さな合成データのセットに縮小します。これにより、新しいモデルを最初から迅速にトレーニングするのに十分な情報が保持されます。
ただし、データセットの蒸留に関するこれまでの研究は画像分類データセットのみに焦点を当てていたのに対し、現代の大規模データセットは主に視覚言語空間にあります。
この研究では、軌道マッチングのアイデアに基づいて、最初のビジョン言語データセット抽出方法を設計します。
重要な課題は、ビジョン言語データセットには一連の離散クラスがないことです。
これを克服するために、私たちが提案する方法では、画像とテキストのペアを対照的な定式化で共同抽出します。
さらに、低ランク適応 (LoRA) マッチングを活用して、複雑な最新の視覚言語モデルでより効率的かつ効果的な軌道マッチングを可能にします。
既存のベースラインがないため、蒸留アプローチを 3 つの適応された視覚言語コアセット選択方法と比較します。
我々は、困難な Flickr30K および COCO 検索ベンチマークで大幅な改善を示しました。たとえば、Flickr30K では、トレーニング用に 1000 個の画像とテキストのペアを選択する最良のコアセット選択方法では、画像とテキストの検索精度 (つまり、recall@1) がわずか 5.6% しか達成されません。
;
対照的に、私たちのデータセット蒸留アプローチでは、わずか 100 (1 桁少ない) トレーニングペアで、そのほぼ 2 倍の 9.9% になります。

要約(オリジナル)

Dataset distillation methods reduce large-scale datasets to smaller sets of synthetic data, which preserve sufficient information for quickly training a new model from scratch. However, prior work on dataset distillation has focused exclusively on image classification datasets, whereas modern large-scale datasets are primarily in the vision-language space. In this work, we design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. To overcome this, our proposed method jointly distills the image-text pairs in a contrastive formulation. Further, we leverage Low-Rank Adaptation (LoRA) matching to enable more efficient and effective trajectory matching in complex modern vision-language models. Since there are no existing baselines, we compare our distillation approach to three adapted vision-language coreset selection methods. We demonstrate significant improvements on the challenging Flickr30K and COCO retrieval benchmarks: for example, on Flickr30K, the best coreset selection method selecting 1000 image-text pairs for training achieves only 5.6% image-to-text retrieval accuracy (i.e., recall@1); in contrast, our dataset distillation approach almost doubles that to 9.9% with just 100 (an order of magnitude fewer) training pairs.

arxiv情報

著者	Xindi Wu,Byron Zhang,Zhiwei Deng,Olga Russakovsky
発行日	2024-02-07 18:57:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language Dataset Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー