Efficient Large-Scale Vision Representation Learning


畳み込みニューラル ネットワークとビジョン トランスフォーマー ファミリの両方において、いくつかの事前学習済みバックボーン アーキテクチャを含む、低リソース設定下で効率的な方法で大規模なビジョン表現学習モデルを微調整するために使用される手法を詳しく説明し、対比します。
下流タスクで派生した視覚表現のオフライン パフォーマンスを評価します。
最後に、Etsy の実稼働環境に導入された機械学習システムのオンライン結果も含めます。


In this article, we present our approach to single-modality vision representation learning. Understanding vision representations of product content is vital for recommendations, search, and advertising applications in e-commerce. We detail and contrast techniques used to fine tune large-scale vision representation learning models in an efficient manner under low-resource settings, including several pretrained backbone architectures, both in the convolutional neural network as well as the vision transformer family. We highlight the challenges for e-commerce applications at-scale and highlight the efforts to more efficiently train, evaluate, and serve visual representations. We present ablation studies for several downstream tasks, including our visually similar ad recommendations. We evaluate the offline performance of the derived visual representations in downstream tasks. To this end, we present a novel text-to-image generative offline evaluation method for visually similar recommendation systems. Finally, we include online results from deployed machine learning systems in production at Etsy.


著者 Eden Dolev,Alaa Awad,Denisa Roberts,Zahra Ebrahimzadeh,Marcin Mejran,Vaibhav Malpani,Mahir Yavuz
発行日 2023-05-24 12:55:21+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG パーマリンク