Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

要約

大規模な視覚言語モデル（LVLMS）は、画像の理解が大幅に進歩しています。
彼らの理解と推論能力により、自律運転シナリオで有望なアプリケーションが可能になります。
ただし、既存の研究は通常、シーン内のフロントビューの視点と部分的なオブジェクトに焦点を当てており、包括的なシーンの理解を達成するのに苦労しています。
一方、既存のLVLMは、2Dと3Dの間のマッピング関係の欠如と、3Dオブジェクトのローカリゼーションと命令の理解の不十分な統合に悩まされています。
これらの制限に取り組むために、最初に、密なシーンキャプションと多様なインタラクティブなタスクにまたがる1.5mを超えるマルチビュー画像言語ペアを備えた大規模なデータセットであるNuinteractを導入します。
さらに、一連の学習可能なクエリを使用してLVLMSを空間プロセッサとシームレスに統合するシンプルで効果的なフレームワークであるDrivemonKeyを提案します。
プラグアンドプレイコンポーネントとして設計された空間プロセッサは、3D認識を改善するために事前に訓練された3D検出器で初期化できます。
私たちの実験は、Drivemonkeyが一般的なLVLMSを上回ること、特に3D視覚接地タスクで9.86％の顕著な改善を達成することを示しています。
データセットとコードはhttps://github.com/zc-zhao/drivemonkeyでリリースされます。

要約(オリジナル)

The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The spatial processor, designed as a plug-and-play component, can be initialized with pre-trained 3D detectors to improve 3D perception. Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task. The dataset and code will be released at https://github.com/zc-zhao/DriveMonkey.

arxiv情報

著者	Zongchuang Zhao,Haoyu Fu,Dingkang Liang,Xin Zhou,Dingyuan Zhang,Hongwei Xie,Bing Wang,Xiang Bai
発行日	2025-05-13 16:36:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー