Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

要約

最新の自動車インフォテインメントシステムには、頻繁なユーザーインターフェイス（UI）の更新と多様な設計バリエーションを処理するためのインテリジェントで適応的なソリューションが必要です。
自動車のインフォテインメントシステムを理解し、相互作用するためのビジョン言語フレームワークを紹介し、さまざまなUIデザインにわたってシームレスな適応を可能にします。
この分野での研究をさらにサポートするために、4,208の注釈付きの998画像のオープンソースデータセットであるAutomotiveUI-Bench-4Kをリリースします。
さらに、トレーニングデータを生成するための合成データパイプラインを提示します。
低ランク適応（LORA）を使用してMolmo-7Bベースのモデルを微調整し、視覚的な接地と評価機能とともに、パイプラインによって生成された推論を組み込みます。
微調整された評価大規模アクションモデル（ELAM）は、AutomotiveUI-Bench-4K（モデルとデータセットが顔を抱きしめて利用できる）で強力なパフォーマンスを実現し、ベースラインモデル上のスクリーンスポットでのA +5.2％の改善を含む強力なクロスドメイン一般化を実証します。
特に、私たちのアプローチは、Infotainmentドメインの訓練を受けているにもかかわらず、Showuiなどのデスクトップ、モバイル、Webの特殊なモデルを密接に一致させる、または密接に一致させる、またはそれを上回る、またはさらに一致している、またはそれを上回ります。
この研究では、データ収集とその後の微調整が、自動車のUIの理解と相互作用の中でAI駆動型の進歩にどのようにつながるかを調査しています。
適用された方法は費用効率が高く、微調整されたモデルは消費者グレードGPUに展開できます。

要約(オリジナル)

Modern automotive infotainment systems require intelligent and adaptive solutions to handle frequent User Interface (UI) updates and diverse design variations. We introduce a vision-language framework for understanding and interacting with automotive infotainment systems, enabling seamless adaptation across different UI designs. To further support research in this field, we release AutomotiveUI-Bench-4K, an open-source dataset of 998 images with 4,208 annotations. Additionally, we present a synthetic data pipeline to generate training data. We fine-tune a Molmo-7B-based model using Low-Rank Adaptation (LoRa) and incorporating reasoning generated by our pipeline, along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face) and demonstrating strong cross-domain generalization, including a +5.2% improvement on ScreenSpot over the baseline model. Notably, our approach achieves 80.4% average accuracy on ScreenSpot, closely matching or even surpassing specialized models for desktop, mobile, and web, such as ShowUI, despite being trained for the infotainment domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven progress within automotive UI understanding and interaction. The applied method is cost-efficient and fine-tuned models can be deployed on consumer-grade GPUs.

arxiv情報

著者	Benjamin Raphael Ernhofer,Daniil Prokhorov,Jannica Langner,Dominik Bollmann
発行日	2025-05-09 09:01:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー