Towards Open-Ended Visual Recognition with Large Language Model

要約

オープンエンドの物理世界におけるオブジェクトの位置を特定して認識することは、機械認識の領域内で長年の課題となっています。
最近の手法では、クラスに依存しないマスク (またはボックス) 提案モデルを採用し、事前に抽出されたテキスト埋め込みを使用するオープン語彙分類子 (CLIP など) によって補完されることで、この問題に対処しようと努めています。
ただし、これらのオープンな語彙認識モデルには、実際の応用では依然として限界があることは注目に値します。
一方で、テスト中のクラス名の提供に依存しており、認識パフォーマンスはユーザーによって事前に定義されたセマンティッククラスのセットに大きく依存します。
一方、複数のデータセットを使用してトレーニングする場合、データセット間のラベル定義の競合を軽減するために人間の介入が必要になります。
このペーパーでは、前述の課題に対する直接的かつ効果的な解決策として、新しいラージ言語モデル (LLM) ベースのマスク分類器であるオムニサイエンティエントモデル (OSM) を紹介します。
具体的には、OSM は生成的な方法でクラスラベルを予測するため、トレーニングとテストの両方でクラス名の供給を排除します。
また、人的介入なしでデータセット間トレーニングを行うことも可能で、LLM から得た世界の知識により堅牢な汎化機能を発揮します。
OSM を既製のマスク提案モデルと組み合わせることで、さまざまなベンチマークで有望な結果を提示し、新しい概念を処理する際の OSM の有効性を実証します。
コード/モデルは https://github.com/bytedance/OmniScient-Model で入手できます。

要約(オリジナル)

Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to address the issue by employing a class-agnostic mask (or box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP) using pre-extracted text embeddings. However, it is worth noting that these open-vocabulary recognition models still exhibit limitations in practical applications. On one hand, they rely on the provision of class names during testing, where the recognition performance heavily depends on this predefined set of semantic classes by users. On the other hand, when training with multiple datasets, human intervention is required to alleviate the label definition conflict between them. In this paper, we introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a straightforward and effective solution to the aforementioned challenges. Specifically, OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing. It also enables cross-dataset training without any human interference, exhibiting robust generalization capabilities due to the world knowledge acquired from the LLM. By combining OSM with an off-the-shelf mask proposal model, we present promising results on various benchmarks, and demonstrate its effectiveness in handling novel concepts. Code/model are available at https://github.com/bytedance/OmniScient-Model.

arxiv情報

著者	Qihang Yu,Xiaohui Shen,Liang-Chieh Chen
発行日	2023-11-14 18:59:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Open-Ended Visual Recognition with Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー