jarxiv | Japanese arxiv | ページ 951

Multi-head Ensemble of Smoothed Classifiers for Certified Robustness

投稿日: 2025年4月14日作成者: jarxiv

要約

ランダム化スムージング（RS）は、認定された堅牢性のための有望な手法であり、最近RSでは、複数の深い神経ネットワーク（DNNS）のアンサンブルがガウスノイズ上の分散縮小効果により最先端のパフォーマンスを示しています。
ただし、このようなアンサンブルは、トレーニングと認証の両方で重い計算負荷をもたらしますが、これらの分類器間の通信は最適化では一般的に無視されるため、個々のDNNとそれらの相互効果の低下をもたらします。
この作業では、複数の拡張ヘッドを備えた単一のDNNの新しいアンサンブルベースのトレーニング方法を検討します。
一部では、アンサンブルを介した分散削減の追求と同様に、単一のDNN内にコサイン制約を備えた複数のヘッドのアンサンブルが、Rsのより安価なトレーニングと認定計算の過負荷で採用されています。
このようなネットワーク構造では、関連するトレーニング戦略は、それらの増強ヘッド間に円形の通信フローを導入することにより設計されています。
つまり、各ヘッドは、認定された堅牢性に関連して特別に設計されたスムーズな損失を使用して、自己ペースの学習戦略を隣人に教えます。
いくつかの共同で展開されたマルチヘッド構造と循環教育スキームは、複数のヘッド間の多様性に貢献し、アンサンブルに利益をもたらし、積極的な実験と議論によって検証された、はるかに少ない計算費用（効率）のコストで複数のDNNS（有効性）を描くよりも、競争力のある認定強力なRSベースの防御をもたらします。

要約(オリジナル)

Randomized Smoothing (RS) is a promising technique for certified robustness, and recently in RS the ensemble of multiple Deep Neural Networks (DNNs) has shown state-of-the-art performances due to its variance reduction effect over Gaussian noises. However, such an ensemble brings heavy computation burdens in both training and certification, and yet under-exploits individual DNNs and their mutual effects, as the communication between these classifiers is commonly ignored in optimization. In this work, we consider a novel ensemble-based training way for a single DNN with multiple augmented heads, named as SmOothed Multi-head Ensemble (SOME). In SOME, similar to the pursuit of variance reduction via ensemble, an ensemble of multiple heads imposed with a cosine constraint inside a single DNN is employed with much cheaper training and certification computation overloads in RS. In such network structure, an associated training strategy is designed by introducing a circular communication flow among those augmented heads. That is, each head teaches its neighbor with the self-paced learning strategy using smoothed losses, which are specifically designed in relation to certified robustness. The deployed multi-head structure and the circular-teaching scheme in SOME jointly contribute to the diversities among multiple heads and benefit their ensemble, leading to a competitively stronger certifiably-robust RS-based defense than ensembling multiple DNNs (effectiveness) at the cost of much less computational expenses (efficiency), verified by extensive experiments and discussions.

arxiv情報

著者	Kun Fang,Qinghua Tao,Yingwen Wu,Tao Li,Xiaolin Huang,Jie Yang
発行日	2025-04-11 12:47:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

投稿日: 2025年4月14日作成者: jarxiv

要約

一般的な環境を積極的に探索しながら、任意のオブジェクトを説明する際のエージェントの能力を改善するための自己監視方法を提示します。
現在のモデルは、カメラの視点や乱雑さが異なるため、一貫した画像キャプションを取得するのに苦労しているため、これは挑戦的な問題です。
コンセンサスメカニズムを介してビュー全体でキャプションの精度と一貫性を高める既存のキャプションモデルを微調整するための3フェーズフレームワークを提案します。
まず、エージェントが環境を探索し、騒々しい画像キャプションのペアを収集します。
次に、各オブジェクトインスタンスの一貫した擬似キャプションが、大きな言語モデルを使用してコンセンサスを介して蒸留されます。
最後に、これらの擬似キャプションは、対照的な学習を追加して、既製のキャプションモデルを微調整するために使用されます。
手動でラベル付けされたテストセットで、キャプションモデル、探索ポリシー、擬似標識方法、微調整戦略の組み合わせのパフォーマンスを分析します。
結果は、古典的なベースラインと比較して、より高い意見の相違でサンプルを採掘するためにポリシーを訓練できることを示しています。
すべてのポリシーと組み合わせて、当社の擬似キャプション方法は、他の既存の方法と比較してセマンティックな類似性が高く、微調整により、キャプションの精度と一貫性が大幅に向上します。
https://hsp-iit.github.io/embodied-captioning/で入手可能なコードおよびテストセットアノテーション

要約(オリジナル)

We present a self-supervised method to improve an agent’s abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/

arxiv情報

著者	Tommaso Galliena,Tommaso Apicella,Stefano Rosa,Pietro Morerio,Alessio Del Bue,Lorenzo Natale
発行日	2025-04-11 13:41:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

投稿日: 2025年4月14日作成者: jarxiv

要約

自動化された運転には正確な車線検出が不可欠であり、さまざまな道路シナリオで安全で信頼できる車両ナビゲーションを可能にします。
レーン検出アルゴリズムの開発と評価をサポートするために多数のデータセットが導入されており、それぞれがデータの量、センサータイプ、注釈の粒度、環境条件、シナリオの多様性という点で異なります。
このペーパーでは、30を超える公開されているレーン検出データセットの包括的なレビューを提供し、その特性、利点、制限を体系的に分析します。
センサーの解像度、注釈の種類、道路や気象条件の多様性などの重要な要因に基づいて、これらのデータセットを分類します。
既存の課題と研究のギャップを特定することにより、堅牢なレーン検出のイノベーションをさらに促進できる将来のデータセット改善の機会を強調します。
この調査は、レーン検出のための適切なデータセットを求める研究者のためのリソースとして機能し、自律運転を進めるというより広い目標に貢献します。

要約(オリジナル)

Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation in a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of over 30 publicly available lane detection datasets, systematically analysing their characteristics, advantages and limitations. We classify these datasets based on key factors such as sensor resolution, annotation types and diversity of road and weather conditions. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This survey serves as a resource for researchers seeking appropriate datasets for lane detection, and contributes to the broader goal of advancing autonomous driving.

arxiv情報

著者	Jörg Gamerdinger,Sven Teufel,Oliver Bringmann
発行日	2025-04-11 13:54:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset

投稿日: 2025年4月14日作成者: jarxiv

要約

デジタルツインカタログ（DTC）を紹介します。これは、新しい大規模なフォトリアリックな3Dオブジェクトデジタルツインデータセットです。
3Dオブジェクトのデジタルツインは、物理的なオブジェクトの非常に詳細で事実上区別できない表現であり、その形状、外観、物理的特性、およびその他の属性を正確にキャプチャします。
神経ベースの3D再構成と逆レンダリングの最近の進歩により、3Dオブジェクトの再構築の品質が大幅に向上しました。
これらの進歩にもかかわらず、さまざまな再構築方法のパフォーマンスを定量的に評価および比較することができ、トレーニングや微調整を通じて再構築品質を改善することができる、大規模でデジタルツインの品質の現実世界データセットとベンチマークが不足しています。
さらに、3Dデジタルツイン作成を民主化するには、作成技術をARメガネなどの次世代のエゴセントリックコンピューティングプラットフォームと統合することが不可欠です。
現在、エゴセントリックのキャプチャされた画像を使用して3Dオブジェクトの再構成を評価するためのデータセットはありません。
これらのギャップに対処するために、DTCデータセットは、DSLRカメラとエゴセントリックARメガネを使用して、さまざまな照明条件下でキャプチャされた画像シーケンスとともに、2,000のスキャンされたデジタルツイン品質の3Dオブジェクトを備えています。
このデータセットは、3Dデジタルツイン作成タスクの最初の包括的な実世界評価ベンチマークを確立し、既存の再構築方法を比較および改善するための堅牢な基盤を提供します。
DTCデータセットは既にhttps://www.projectaria.com/datasets/dtc/でリリースされており、ベースライン評価もオープンソースにします。

要約(オリジナル)

We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. A digital twin of a 3D object is a highly detailed, virtually indistinguishable representation of a physical object, accurately capturing its shape, appearance, physical properties, and other attributes. Recent advances in neural-based 3D reconstruction and inverse rendering have significantly improved the quality of 3D object reconstruction. Despite these advancements, there remains a lack of a large-scale, digital twin quality real-world dataset and benchmark that can quantitatively assess and compare the performance of different reconstruction methods, as well as improve reconstruction quality through training or fine-tuning. Moreover, to democratize 3D digital twin creation, it is essential to integrate creation techniques with next-generation egocentric computing platforms, such as AR glasses. Currently, there is no dataset available to evaluate 3D object reconstruction using egocentric captured images. To address these gaps, the DTC dataset features 2,000 scanned digital twin-quality 3D objects, along with image sequences captured under different lighting conditions using DSLR cameras and egocentric AR glasses. This dataset establishes the first comprehensive real-world evaluation benchmark for 3D digital twin creation tasks, offering a robust foundation for comparing and improving existing reconstruction methods. The DTC dataset is already released at https://www.projectaria.com/datasets/dtc/ and we will also make the baseline evaluations open-source.

arxiv情報

著者	Zhao Dong,Ka Chen,Zhaoyang Lv,Hong-Xing Yu,Yunzhi Zhang,Cheng Zhang,Yufeng Zhu,Stephen Tian,Zhengqin Li,Geordie Moffatt,Sean Christofferson,James Fort,Xiaqing Pan,Mingfei Yan,Jiajun Wu,Carl Yuheng Ren,Richard Newcombe
発行日	2025-04-11 13:54:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.GR, cs.RO | コメントを受け付けていません

Discriminator-Free Direct Preference Optimization for Video Diffusion

投稿日: 2025年4月14日作成者: jarxiv

要約

直接選好最適化（DPO）は、WIN/LOSITデータペアを通じてモデルを人間の好みに合わせて、言語と画像生成で顕著な成功を収めています。
ただし、ビデオ拡散モデルにDPOを適用すると、重要な課題に直面しています。（1）データの非効率性。
DPOイテレーションごとに数千のビデオを生成すると、法外なコストが発生します。
（2）評価の不確実性。
人間の注釈は主観的なバイアスに悩まされ、自動化された判別器は、ちらつきや動きの一貫性のない微妙な時間的アーティファクトを検出できません。
これらに対処するために、（1）オリジナルの実際のビデオをWINケースとして使用し、編集されたバージョン（例：逆転、シャッフル、またはノイズ腐敗したクリップ）を失うケースとして使用する。
（2）編集によって導入されたアーティファクトを区別および回避するために、ビデオ拡散モデルをトレーニングします。
このアプローチは、費用のかかる合成ビデオ比較の必要性を排除し、明確な品質信号を提供し、単純な編集操作を通じて無制限のトレーニングデータの拡張を可能にします。
実際のビデオやモデル生成ビデオが異なる分布に従っている場合でも、フレームワークの有効性を理論的に証明します。
Cogvideoxの実験は、提案された方法の効率を示しています。

要約(オリジナル)

Direct Preference Optimization (DPO), which aligns models with human preferences through win/lose data pairs, has achieved remarkable success in language and image generation. However, applying DPO to video diffusion models faces critical challenges: (1) Data inefficiency. Generating thousands of videos per DPO iteration incurs prohibitive costs; (2) Evaluation uncertainty. Human annotations suffer from subjective bias, and automated discriminators fail to detect subtle temporal artifacts like flickering or motion incoherence. To address these, we propose a discriminator-free video DPO framework that: (1) Uses original real videos as win cases and their edited versions (e.g., reversed, shuffled, or noise-corrupted clips) as lose cases; (2) Trains video diffusion models to distinguish and avoid artifacts introduced by editing. This approach eliminates the need for costly synthetic video comparisons, provides unambiguous quality signals, and enables unlimited training data expansion through simple editing operations. We theoretically prove the framework’s effectiveness even when real videos and model-generated videos follow different distributions. Experiments on CogVideoX demonstrate the efficiency of the proposed method.

arxiv情報

著者	Haoran Cheng,Qide Dong,Liang Peng,Zhizhou Sha,Weiguo Feng,Jinghui Xie,Zhao Song,Shilei Wen,Xiaofei He,Boxi Wu
発行日	2025-04-11 13:55:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

投稿日: 2025年4月14日作成者: jarxiv

要約

ディープフェイクの顔の急増は、私たちの日常生活に大きな潜在的な悪影響をもたらします。
これらの年にわたるディープファーク検出の実質的な進歩にもかかわらず、目に見えないデータセットからの偽造または新たな生成モデルによって作成された既存の方法の一般化可能性は、拘束されたままです。
この論文では、視覚言語モデル（VLMS）のゼロショットの利点に触発されて、一般的なディープフェイク検出のためによく訓練されたVLMを再利用する新しいアプローチを提案します。
入力摂動を介してモデル予測を操作するパラダイムの再プログラミングモデルによって動機付けられているため、この方法は、内部パラメーターを調整せずに入力を操作することに基づいて、事前に訓練されたVLMモデル（たとえば、クリップ）を再プログラムできます。
まず、学習可能な視覚的摂動を使用して、ディープフェイク検出のための特徴抽出を改良します。
次に、顔の埋め込みの情報を活用して、サンプルレベルの適応テキストプロンプトを作成し、パフォーマンスを改善します。
いくつかの一般的なベンチマークデータセットでの広範な実験は、（1）ディープフェイク検出のクロスダタセットおよびクロスマニピュレーションパフォーマンスが大幅かつ一貫して改善できることを示しています（たとえば、FF ++からWildDeepfakeまでのクロスダタセット設定で88 \％AUCを超える）;
（2）優れたパフォーマンスは、トレーニング可能なパラメーターが少ないため達成されているため、実際のアプリケーションに対する有望なアプローチとなっています。

要約(オリジナル)

The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross-dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88\% AUC in cross-dataset setting from FF++ to WildDeepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.

arxiv情報

著者	Kaiqing Lin,Yuzhen Lin,Weixiang Li,Taiping Yao,Bin Li
発行日	2025-04-11 13:57:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

投稿日: 2025年4月14日作成者: jarxiv

要約

リモートセンシングでは、同じシーンをキャプチャするさまざまなセンサーのマルチモーダルデータが豊富な機会を提供しますが、これらのモダリティ全体で統一された表現を学ぶことは依然として重要な課題です。
従来の方法は、多くの場合、単一またはデュアルモダリティアプローチに限定されています。
この論文では、主要なTOMデータセットからの光学、レーダー、および標高データで訓練された生成拡散モデルであるCop-Gen-Betaを紹介します。
Cop-Gen-Betaを際立たせているのは、モダリティのサブセットを他の任意の任意のものにマッピングする能力であり、トレーニング後にゼロショットモダリティ翻訳を可能にします。
これは、各モダリティが独自のタイムステップ埋め込みによって制御されるシーケンスベースの拡散トランスを通じて達成されます。
主要なTOMデータセットのサムネイル画像でCop-Gen-betaを広範囲に評価し、高品質のサンプルを生成する際の有効性を示しています。
定性的および定量的評価モデルのパフォーマンスを検証し、将来のリモートセンシングタスクの強力な事前訓練モデルとしての可能性を強調します。

要約(オリジナル)

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model’s performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

arxiv情報

著者	Miguel Espinosa,Valerio Marsocci,Yuru Jia,Elliot J. Crowley,Mikolaj Czerkawski
発行日	2025-04-11 14:00:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.GR | コメントを受け付けていません

Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

投稿日: 2025年4月14日作成者: jarxiv

要約

継続的な一般化されたカテゴリの発見が、以前に学んだカテゴリの壊滅的な忘却を避けながら、着信データバッチで継続的に発見および学習することを目的とする方法として、文献で導入および研究されています。
この課題に対処する重要な要素は、極端な価値理論（EVT）が効果的に採用されている新しいサンプルを分離するモデルの能力です。
この作業では、EVTとプロキシアンカーを統合して、包含関数の確率を使用してプロキシに関する境界を定義する新しい方法を提案し、未知のサンプルの拒絶を可能にします。
さらに、学習した表現を強化するための新しいEVTベースの損失関数を導入し、同様の設定で他の深部メトリック学習方法と比較して優れたパフォーマンスを実現します。
導出された確率関数を使用して、新しいサンプルは以前に既知のカテゴリから効果的に分離されています。
ただし、これらの新しいサンプル内のカテゴリの発見は、新しいカテゴリの数を過大評価することがあります。
この問題を軽減するために、モデルサイズを削減し、冗長プロキシを破棄するための新しいEVTベースのアプローチを提案します。
また、壊滅的な忘却を防ぐために、継続的な学習段階で経験のリプレイと知識の蒸留メカニズムを組み込みます。
実験結果は、提案されたアプローチが、継続的な一般化されたカテゴリの発見シナリオで最先端の方法を上回ることを示しています。

要約(オリジナル)

Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model’s ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.

arxiv情報

著者	Alireza Fathalizadeh,Roozbeh Razavi-Far
発行日	2025-04-11 14:01:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Shadow Erosion and Nighttime Adaptability for Camera-Based Automated Driving Applications

投稿日: 2025年4月14日作成者: jarxiv

要約

RGBカメラからの画像の強化は、医療イメージング、衛星イメージング、自動運転など、幅広い増加するアプリケーションのために特に興味深いものです。自律運転では、さまざまな技術が挑戦的な照明条件下で画質を高めるために使用されます。
これらには、夜間の劣悪な状態での視界を改善するための人工的な増強、照明の変動の影響を減らすための照明不変のイメージング、および明るい日光での一貫した画像の明確性を確保するための影の緩和が含まれます。
このペーパーでは、色とテクスチャの詳細を維持しながら、自動化された運転用途向けの画像の影の侵食と夜間の適応性のパイプラインを提案します。
影の侵食と夜間の適応性パイプラインは、広く使用されているClahe技術と比較され、照明の均一性と視覚的知覚品質メトリックに基づいて評価されます。
また、この結果は、Claheよりも大幅な改善を示しており、ヨーロベースの運転可能な領域セグメンテーションアルゴリズムを強化しています。

要約(オリジナル)

Enhancement of images from RGB cameras is of particular interest due to its wide range of ever-increasing applications such as medical imaging, satellite imaging, automated driving, etc. In autonomous driving, various techniques are used to enhance image quality under challenging lighting conditions. These include artificial augmentation to improve visibility in poor nighttime conditions, illumination-invariant imaging to reduce the impact of lighting variations, and shadow mitigation to ensure consistent image clarity in bright daylight. This paper proposes a pipeline for Shadow Erosion and Nighttime Adaptability in images for automated driving applications while preserving color and texture details. The Shadow Erosion and Nighttime Adaptability pipeline is compared to the widely used CLAHE technique and evaluated based on illumination uniformity and visual perception quality metrics. The results also demonstrate a significant improvement over CLAHE, enhancing a YOLO-based drivable area segmentation algorithm.

arxiv情報

著者	Mohamed Sabry,Gregory Schroeder,Joshua Varughese,Cristina Olaverri-Monreal
発行日	2025-04-11 14:02:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

F-LMM: Grounding Frozen Large Multimodal Models

投稿日: 2025年4月14日作成者: jarxiv

要約

視覚的な接地能力を備えた大規模なマルチモーダルモデル（LMM）を支えると、視覚世界と人間との相互作用に対するAISの理解が大幅に向上する可能性があります。
ただし、既存のメソッドは通常、LMMのパラメーターを微調整して、追加のセグメンテーショントークンとオーバーフィットの接地およびセグメンテーションデータセットを学習します。
このような設計は、一般的なAIアシスタントの不可欠な会話能力に壊滅的な減少を必然的に引き起こすでしょう。
この論文では、一連のマルチモーダルの質問アンウェーベンチマークで、最先端の接地LMMを包括的に評価し、一般的な知識の理解と能力の後に衰弱した指示を弱めることを示す劇的なパフォーマンスドロップを観察します。
この問題に対処するために、F-lmmを提示します – 人間の会話で凍った既製のLMMを接地します – 視覚的接地を助長するワードピクセルの対応が、よく訓練されたLMMの注意メカニズムに本質的に存在するという事実に基づいた簡単で効果的な設計です。
いくつかのトレーニング可能なCNNレイヤーのみを使用して、SAMベースのマスク精製者がさらに最適化できるワードピクセルの注意の重みをマスクロジットに翻訳できます。
私たちのF-LMMは、特別なセグメンテーショントークンを学習したり、高品質の接地命令調整データを使用したりすることはありませんが、LMMSの元の会話能力を完全に保存しながら、表現セグメンテーションとパノプティックな物語の接地ベンチマークを参照する競争力のあるパフォーマンスを実現します。
さらに、命令に従う能力が保存され、接地能力が得られたため、F-LMMは、推論セグメンテーション、接地された会話生成、視覚的なチェーンの推論などの複雑なタスクに直接適用できます。
私たちのコードはhttps://github.com/wusize/f-lmmにあります。

要約(オリジナル)

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs’ understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM — grounding frozen off-the-shelf LMMs in human-AI conversations — a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs’ original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.

arxiv情報

著者	Size Wu,Sheng Jin,Wenwei Zhang,Lumin Xu,Wentao Liu,Wei Li,Chen Change Loy
発行日	2025-04-11 14:21:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント