VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

要約

ディープラーニングベースのモデルは、現実の世界でロングテールデータを処理するときに課題に直面します。
既存のソリューションは通常、画像モダリティに基づいて、クラスの不均衡の問題に対処するために、いくつかのバランス戦略または転移学習を採用しています。
この作業では、VL-LTRと呼ばれる視覚言語のロングテール認識フレームワークを提示し、ロングテール認識（LTR）にテキストモダリティを導入することの利点に関する実証的研究を実施します。
既存のアプローチと比較して、提案されたVL-LTRには以下のメリットがあります。
（1）私たちの方法は、画像から視覚的表現を学習するだけでなく、インターネットから収集されたノイズの多いクラスレベルのテキスト記述から対応する言語表現を学習することもできます。
（2）私たちの方法は、特に画像サンプルが少ないクラスの場合、学習した視覚言語表現を効果的に使用して、視覚認識パフォーマンスを向上させることができます。
また、広範な実験を実施し、広く使用されているLTRベンチマークに新しい最先端のパフォーマンスを設定します。
特に、私たちの方法は、ImageNet-LTで77.2％の全体的な精度を達成します。これは、以前の最良の方法を17ポイント以上大幅に上回り、完全なImageNetでの一般的なパフォーマンストレーニングに近いものです。
コードはhttps://github.com/ChangyaoTian/VL-LTRで入手できます。

要約(オリジナル)

Deep learning-based models encounter challenges when processing long-tailed data in the real world. Existing solutions usually employ some balancing strategies or transfer learning to deal with the class imbalance problem, based on the image modality. In this work, we present a visual-linguistic long-tailed recognition framework, termed VL-LTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition (LTR). Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy class-level text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-the-art performance on widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code is available at https://github.com/ChangyaoTian/VL-LTR.

arxiv情報

著者	Changyao Tian,Wenhai Wang,Xizhou Zhu,Jifeng Dai,Yu Qiao
発行日	2022-07-19 16:24:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー