Leveraging VLM-Based Pipelines to Annotate 3D Objects

要約

事前トレーニングされたビジョン言語モデル (VLM) は、ラベルのない 3D オブジェクトに大規模なキャプションを付ける機会を提供します。
オブジェクトのさまざまなビューから VLM 記述を要約する主要なアプローチ (Luo et al.、2023) は、言語モデル (GPT4) に依存して最終出力を生成します。
このテキストベースの集約は、矛盾する可能性のある説明を統合するため、幻覚の影響を受けやすくなります。
我々は、VLM の応答に影響を与える視点などの要因を無視するための代替アルゴリズムを提案します。
テキストのみの応答をマージする代わりに、VLM の画像とテキストの結合尤度を利用します。
確率的集計がより信頼性が高く効率的であるだけでなく、人間が検証したラベルに関してオブジェクトタイプを推論する際に SoTA を設定することを示します。
集約されたアノテーションは、条件付き推論にも役立ちます。
これらは、オブジェクトのタイプが補助的なテキストベースの入力として指定されている場合に、下流の予測 (オブジェクトのマテリアルなど) を改善します。
このような補助入力により、教師なしの設定において、視覚的推論に対する視覚的推論の寄与を軽減することができます。
これらの教師あり評価と教師なし評価を使用して、VLM ベースのパイプラインを活用して、Objaverse データセットから 764K オブジェクトの信頼できるアノテーションを生成する方法を示します。

要約(オリジナル)

Pretrained vision language models (VLMs) present an opportunity to caption unlabeled 3D objects at scale. The leading approach to summarize VLM descriptions from different views of an object (Luo et al., 2023) relies on a language model (GPT4) to produce the final output. This text-based aggregation is susceptible to hallucinations as it merges potentially contradictory descriptions. We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM’s response. Instead of merging text-only responses, we utilize the VLM’s joint image-text likelihoods. We show our probabilistic aggregation is not only more reliable and efficient, but sets the SoTA on inferring object types with respect to human-verified labels. The aggregated annotations are also useful for conditional inference; they improve downstream predictions (e.g., of object material) when the object’s type is specified as an auxiliary text-based input. Such auxiliary inputs allow ablating the contribution of visual reasoning over visionless reasoning in an unsupervised setting. With these supervised and unsupervised evaluations, we show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the Objaverse dataset.

arxiv情報

著者	Rishabh Kabra,Loic Matthey,Alexander Lerchner,Niloy J. Mitra
発行日	2024-06-17 17:27:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Leveraging VLM-Based Pipelines to Annotate 3D Objects

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー