See It All: Contextualized Late Aggregation for 3D Dense Captioning

要約

3D デンスキャプションは、3D シーン内のオブジェクトの位置を特定し、各オブジェクトの説明文を生成するタスクです。
3D デンスキャプションの最近のアプローチでは、オブジェクト検出からトランスフォーマーエンコーダー/デコーダーフレームワークを採用して、手作りのコンポーネントを使用せずにエンドツーエンドのパイプラインを構築しています。
ただし、これらのアプローチは、単一のクエリアテンションで、厳密にローカライズされたオブジェクト領域とコンテキスト環境の両方を同時に表示する必要があるという矛盾した目的に苦労しています。
この課題を克服するために、遅延集約と呼ばれる新しいパラダイムを使用して 3D 高密度キャプションを実行する変換パイプラインである SIA (See-It-All) を導入します。
SIA は、コンテキストクエリとインスタンスクエリという 2 つのクエリセットを同時にデコードします。
インスタンスクエリはローカライゼーションとオブジェクトの属性の説明に焦点を当てますが、コンテキストクエリは複数のオブジェクト間またはグローバルシーンとの関係の関心領域を多用途にキャプチャし、その後単純な距離ベースの測定によって集約 (つまり後期集約) します。
コンテキスト化されたキャプション生成の品質をさらに高めるために、周囲のコンテキスト、グローバル環境、オブジェクトインスタンスに基づいて、十分な情報に基づいたキャプションを生成する新しいアグリゲーターを設計します。
最も広く使用されている 2 つの 3D 高密度キャプションデータセットに対する広範な実験により、私たちが提案した方法が従来の方法に比べて大幅な改善を達成することが実証されました。

要約(オリジナル)

3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.

arxiv情報

著者	Minjung Kim,Hyung Suk Lim,Seung Hwan Kim,Soonyoung Lee,Bumsoo Kim,Gunhee Kim
発行日	2024-08-14 16:19:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

See It All: Contextualized Late Aggregation for 3D Dense Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー