Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

要約

さまざまなコンピュータービジョンおよびビジョン言語タスクを統合したプロンプトベースの表現を備えた、新しいビジョン基盤モデルである Florence-2 を紹介します。
既存の大型ビジョンモデルは転移学習には優れていますが、単純な命令で多様なタスクを実行するのに苦労しています。これは、さまざまな空間階層と意味論的な粒度の複雑さを処理することを意味します。
Florence-2 は、タスクの指示としてテキストプロンプトを受け取り、キャプション、物体検出、グラウンディング、セグメンテーションなど、望ましい結果をテキスト形式で生成するように設計されています。
このマルチタスクの学習セットアップには、大規模で高品質の注釈付きデータが必要です。
この目的を達成するために、自動化された画像アノテーションとモデル改良の反復戦略を使用して、1 億 2,600 万枚の画像に対する 54 億の包括的な視覚的アノテーションで構成される FLD-5B を共同開発しました。
私たちは、Florence-2 が多用途かつ包括的な視覚タスクを実行できるようにトレーニングするために、シーケンスツーシーケンス構造を採用しました。
多数のタスクに関する広範な評価により、Florence-2 が前例のないゼロショット機能と微調整機能を備えた強力なビジョン基盤モデルの候補であることが実証されました。

要約(オリジナル)

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

arxiv情報

著者	Bin Xiao,Haiping Wu,Weijian Xu,Xiyang Dai,Houdong Hu,Yumao Lu,Michael Zeng,Ce Liu,Lu Yuan
発行日	2023-11-10 18:59:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー