CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds

要約

最近の研究では、大規模言語モデル (LLM) がテキストのみのタスクに限定されず、音声、画像、ビデオなどのさまざまなモダリティにわたるマルチモーダルモデルとしても機能できることが実証されました。
特に、3D 大規模マルチモーダルモデル (3D LMM) に関する研究は、点群のような高次元データを処理できる可能性によって顕著な進歩を遂げています。
しかし、詳しく調べてみると、既存のトレーニングデータセットの各サンプル内のビジュアルおよびテキストコンテンツには、情報の粒度や明瞭さが欠けており、これがクロスモーダルな正確な理解のボトルネックとなっていることがわかりました。
これらの問題に対処するために、私たちは、高解像度の点群上のオッズ比による 3D 大規模マルチモーダルモデルの対照学習である CL3DOR を提案します。これは、ビジュアルコンテンツとテキストコンテンツの両方でより高い特異性と明確さを確保するように設計されています。
具体的には、オブジェクトごとの点群の密度を高め、トレーニングデータセット内で有益なハードネガティブ応答を構築して、不要な応答にペナルティを与えます。
ハードネガティブ応答を活用するために、対照学習のための補助項としてオッズ比を従来の言語モデリング損失に組み込みます。
CL3DOR は、3D シーンの理解と推論のベンチマークにおいて最先端のパフォーマンスを実現します。
さらに、広範な実験を通じて CL3DOR の主要コンポーネントの有効性を実証します。

要約(オリジナル)

Recent research has demonstrated that Large Language Models (LLMs) are not limited to text-only tasks but can also function as multimodal models across various modalities, including audio, images, and videos. In particular, research on 3D Large Multimodal Models (3D LMMs) is making notable strides, driven by the potential of processing higher-dimensional data like point clouds. However, upon closer examination, we find that the visual and textual content within each sample of existing training datasets lacks both high informational granularity and clarity, which serve as a bottleneck for precise cross-modal understanding. To address these issues, we propose CL3DOR, Contrastive Learning for 3D large multimodal models via Odds ratio on high-Resolution point clouds, designed to ensure greater specificity and clarity in both visual and textual content. Specifically, we increase the density of point clouds per object and construct informative hard negative responses in the training dataset to penalize unwanted responses. To leverage hard negative responses, we incorporate the odds ratio as an auxiliary term for contrastive learning into the conventional language modeling loss. CL3DOR achieves state-of-the-art performance in 3D scene understanding and reasoning benchmarks. Additionally, we demonstrate the effectiveness of CL3DOR’s key components through extensive experiments.

arxiv情報

著者	Keonwoo Kim,Yeongjae Cho,Taebaek Hwang,Minsoo Jo,Sangdo Han
発行日	2025-01-07 15:42:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー