An Empirical Study of CLIP for Text-based Person Search

要約

テキストベースの人物検索 (TBPS) は、自然言語の記述を使用して人物の画像を取得することを目的としています。
最近、普遍的な大規模クロスモーダル視覚言語事前トレーニングモデルである Contrastive Language Image Pretraining (CLIP) は、その強力なクロスモーダル意味論的学習能力により、さまざまなクロスモーダル下流タスクで目覚ましいパフォーマンスを発揮しました。
TPBS は、きめ細かいクロスモーダル検索タスクとして、CLIP ベースの TBPS に関する研究の台頭にも直面しています。
下流の TBPS タスクに対する視覚言語事前トレーニングモデルの可能性を探るため、この論文では、TBPS 用の CLIP の包括的な実証研究を実施するという初めての試みを行い、それによって直接的で増分的でありながら強力な TBPS-CLIP ベースラインに貢献します。
TBPSコミュニティへ。
データ拡張や損失関数など、CLIP に基づいた重要な設計上の考慮事項を再検討します。
このモデルは、前述の設計と実践的なトレーニングトリックを備えており、高度なモジュールがなくても満足のいくパフォーマンスを達成できます。
また、モデル汎化やモデル圧縮におけるTBPS-CLIPの精査実験も行い、TBPS-CLIPの有効性をさまざまな側面から実証しています。
この研究は経験的な洞察を提供し、将来の CLIP ベースの TBPS 研究に焦点を当てることが期待されています。

要約(オリジナル)

Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

arxiv情報

著者	Min Cao,Yang Bai,Ziyin Zeng,Mang Ye,Min Zhang
発行日	2023-12-21 04:01:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Empirical Study of CLIP for Text-based Person Search

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー