jarxiv | Japanese arxiv | ページ 1314

MetaScale: Test-Time Scaling with Evolving Meta-Thoughts

投稿日: 2025年3月18日作成者: jarxiv

要約

複雑な推論を行うための大規模な言語モデル（LLMS）にとっての重要な課題の1つは、特定のタスクを解決するための最も適切な認知戦略を積極的に選択するのではなく、トレーニングデータからの合理的なパターンを一致させることに依存することです。
既存のアプローチは、特定のタスクのパフォーマンスを向上させるが、多様なシナリオ全体で適応性を欠いている固定認知構造を課します。
この制限に対処するために、メタの思考に基づいたテスト時間スケーリングフレームワークであるメタスケールを紹介します。各タスクに合わせた適応的思考戦略です。
Metascaleは、候補メタ思考のプールを初期化し、その後、報酬モデルに導かれた、信頼境界選択を備えたマルチアームのBanditアルゴリズムを使用してそれらを繰り返し選択して評価します。
適応性をさらに向上させるために、遺伝的アルゴリズムは高報酬のメタ思考を進化させ、時間の経過とともに戦略プールを改良および拡張します。
推論時にメタ思考を動的に提案および最適化することにより、メタスカールは、幅広いタスクにわたって精度と一般化の両方を改善します。
実験結果は、メタスケールが一貫して標準推論アプローチを上回り、GPT-4Oのアリーナハードの勝利率で11％のパフォーマンスを達成し、スタイル制御下でO1-MINIを0.9％上回ることを示しています。
特に、メタスケールはサンプリング予算の増加とともにより効果的にスケーリングし、より構造化された専門家レベルの応答を生み出します。

要約(オリジナル)

One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts — adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.

arxiv情報

著者	Qin Liu,Wenxuan Zhou,Nan Xu,James Y. Huang,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
発行日	2025-03-17 17:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

Gradient Extrapolation for Debiased Representation Learning

投稿日: 2025年3月18日作成者: jarxiv

要約

経験的リスク最小化（ERM）で訓練された機械学習分類モデルは、しばしば不注意に偽の相関に依存しています。
テストデータに存在しない場合、非標的属性とターゲットラベルとの間のこれらの意図しない関連性は、一般化が不十分になります。
このペーパーでは、モデルの最適化の観点からこの問題に対処し、既知の属性トレーニングケースの両方で偏見の表現を学習するように設計された、Debiased表現学習（Gerne）の勾配外挿（Gerne）を提案します。
Gerneは、異なる量のスプリアス相関を持つ2つの異なるバッチを使用して、各バッチの損失から計算された2つの勾配の線形外挿としてターゲット勾配を定義します。
不適切な相関の量が少ないバッチの勾配に向けられた場合、外挿された勾配は、偏見モデルの学習に向けてトレーニングプロセスを導くことができることが実証されています。
Gerneは、特別なケースとして示されているERM、REWEIGNING、RESAMPLINGなどの方法を使用して紛失するための一般的な枠組みとして機能します。
外挿係数の理論上の上限と下限は、収束を確保するために導出されます。
この因子を調整することにより、Gerneを調整して、グループバランスの精度（GBA）または最悪のグループの精度を最大化することができます。
提案されたアプローチは、5つのビジョンと1つのNLPベンチマークで検証されており、最先端のベースライン方法と比較して競争力のある、しばしば優れたパフォーマンスを示しています。

要約(オリジナル)

Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch’s loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.

arxiv情報

著者	Ihab Asaad,Maha Shadaydeh,Joachim Denzler
発行日	2025-03-17 14:48:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Sampling Innovation-Based Adaptive Compressive Sensing

投稿日: 2025年3月18日作成者: jarxiv

要約

シーンを認識している適応型圧縮センシング（ACS）は、シーン画像の効率的かつ高忠実度の獲得のための有望な能力により、大きな関心を集めています。
ACSは通常、グラウンドトゥルースがない場合の以前のサンプルに基づいて、適応サンプリング割り当て（ASA）を規定しています。
ただし、未知のシーンに直面する場合、既存のACSメソッドには、ASAの正確な判断と堅牢なフィードバックメカニズムがしばしば欠けているため、シーンの高忠実度センシングが制限されます。
このホワイトペーパーでは、サンプリングを効果的に識別し、挑戦的な画像再構成エリアに割り当てることができるサンプリングイノベーションベースのACS（SIB-ACS）メソッドを紹介し、高忠実度の画像再構成に達します。
サンプリングの増加に起因する画像再構成エラーの減少を予測することにより、ASAを判断するためのイノベーション基準が提案され、それにより、再構成エラーが大幅に減少する領域に向けてより多くのサンプルを向けることが提案されています。
サンプリングイノベーションガイド付きマルチステージ適応サンプリング（AS）フレームワークが提案されており、マルチステージフィードバックプロセスを通じてASAを繰り返し洗練します。
画像再構成のために、主成分圧縮ドメインネットワーク（PCCD-NET）を提案します。これは、シナリオの下で画像を効率的かつ忠実に再構築します。
広範な実験は、提案されたSIB-ACS法が、画像の再構築の忠実度と視覚効果の観点から最先端の方法を大幅に上回ることを示しています。
コードはhttps://github.com/giant-pandada/sib-acs_cvpr2025で入手できます。

要約(オリジナル)

Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects. Codes are available at https://github.com/giant-pandada/SIB-ACS_CVPR2025.

arxiv情報

著者	Zhifu Tian,Tao Hu,Chaoyang Niu,Di Wu,Shu Wang
発行日	2025-03-17 14:54:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Don’t Judge Before You CLIP: A Unified Approach for Perceptual Tasks

投稿日: 2025年3月18日作成者: jarxiv

要約

視覚的知覚タスクは、画像の人間の判断を予測することを目的としています（たとえば、画像によって呼び出された感情、画像品質評価）。
オブジェクト/シーン認識などの客観的なタスクとは異なり、知覚タスクは主観的な人間の評価に依存しており、データラベルの困難を困難にします。
このような人間が解釈されたデータの希少性は、小さなデータセットをもたらし、一般化が不十分になります。
通常、専門モデルは、その独自の特性と独自のトレーニングデータセットに合わせて調整された知覚タスクごとに設計されています。
以前のようにクリップを活用する複数の異なる知覚タスクを解決するための統一されたアーキテクチャフレームワークを提案します。
私たちのアプローチは、クリップが人間の判断とよく相関することを示す最近の認知的発見に基づいています。
クリップは画像とテキストを調整するために明示的に訓練されましたが、暗黙的に人間の傾向も学びました。
これは、クリップのトレーニングデータに人間が作成した画像キャプションを含めることに起因します。これには、事実上の画像の説明だけでなく、必然的に人間の感情や感情も含まれています。
これにより、Clipは知覚タスクに対して特に強力な事前になります。
したがって、さまざまな知覚タスクを解決するのに最小限のクリップの適応で十分であることをお勧めします。
当社のシンプルな統一フレームワークは、タスク固有のアーキテクチャの変更を必要とせずに、各タスクへのクリップを微調整するための軽量化を採用しています。
3つのタスクでアプローチを評価します：（i）画像の記憶性予測、（ii）参照なしの画質評価、および（iii）視覚感情分析。
私たちのモデルは、3つのタスクすべてで最先端の結果を達成し、異なるデータセット全体で改善された一般化を実証します。

要約(オリジナル)

Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP’s training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.

arxiv情報

著者	Amit Zalcher,Navve Wasserman,Roman Beliy,Oliver Heinimann,Michal Irani
発行日	2025-03-17 15:15:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

投稿日: 2025年3月18日作成者: jarxiv

要約

3Dデータが不足しているため、単一の画像から360 {\ deg}回転やズームを含む柔軟な視聴3Dシーンを生成することは困難です。
この目的のために、2つの重要なコンポーネントで構成される新しいフレームワークであるFlexWorldを紹介します。（1）強力なビデオからビデオへの拡散モデルで、粗いシーンからレンダリングされた不完全な入力から高品質の新規ビュー画像を生成し、（2）完全な3Dシーンを構築するためのプログレッシブ拡張プロセス。
特に、高度な事前訓練を受けたビデオモデルと正確な深さを推定するトレーニングペアを活用すると、V2Vモデルは、大きなカメラポーズバリエーションの下で新しいビューを生成できます。
それに基づいて、FlexWorldは徐々に新しい3Dコンテンツを生成し、Geometry-Awareシーンフュージョンを通じてグローバルシーンに統合します。
広範な実験は、既存の最先端の方法と比較して、複数の一般的なメトリックとデータセットの下で優れた視覚品質を達成する、高品質の斬新なビュービデオと柔軟なビュー3Dシーンを生成する際のFlexWorldの有効性を示しています。
定性的には、FlexWorldが360 {\ deg}回転やズームなどの柔軟なビューで高忠実度のシーンを生成できることを強調しています。
プロジェクトページ：https：//ml-gsai.github.io/flexworld。

要約(オリジナル)

Generating flexible-view 3D scenes, including 360{\deg} rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360{\deg} rotations and zooming. Project page: https://ml-gsai.github.io/FlexWorld.

arxiv情報

著者	Luxi Chen,Zihan Zhou,Min Zhao,Yikai Wang,Ge Zhang,Wenhao Huang,Hao Sun,Ji-Rong Wen,Chongxuan Li
発行日	2025-03-17 15:18:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Boundary Constraint-free Biomechanical Model-Based Surface Matching for Intraoperative Liver Deformation Correction

投稿日: 2025年3月18日作成者: jarxiv

要約

画像誘導肝臓手術では、3D-3D非剛性登録方法は、術前モデルとポイント雲として表される術中表面との間のマッピングを推定する上で重要な役割を果たし、組織の変形の課題に対処します。
通常、これらの方法は、表面マッチング項を正規化するために、有限要素モデル（FEM）として表される生体力学モデル（FEM）をひずみエネルギー用語に組み込みます。
変更されたFEMを表面マッチング用語に組み込んだ3D-3D非剛体登録方法を提案します。
修正されたFEMは、境界条件を指定する必要性を軽減します。これは、FEMの剛性マトリックスを変更し、安定化のために対角線荷重を使用することによって達成されます。
その結果、修正された表面マッチング項では、境界条件の仕様または表面マッチング項を正規化するための追加のひずみエネルギー用語は必要ありません。
最適化は、最適なステップサイズを決定するための提案された方法によってさらに強化され、加速された勾配アルゴリズムを通じて達成されます。
方法を評価し、さまざまなデータセットにわたるいくつかの最先端のメソッドと比較しました。
私たちの率直で効果的なアプローチは、一貫して最先端の方法と同等のパフォーマンスを上回るか、達成しました。
コードとデータセットはhttps://github.com/zixinyang9109/bcf-femで入手できます。

要約(オリジナル)

In image-guided liver surgery, 3D-3D non-rigid registration methods play a crucial role in estimating the mapping between the preoperative model and the intraoperative surface represented as point clouds, addressing the challenge of tissue deformation. Typically, these methods incorporate a biomechanical model, represented as a finite element model (FEM), into the strain energy term to regularize a surface matching term. We propose a 3D-3D non-rigid registration method that incorporates a modified FEM into the surface matching term. The modified FEM alleviates the need to specify boundary conditions, which is achieved by modifying the stiffness matrix of a FEM and using diagonal loading for stabilization. As a result, the modified surface matching term does not require the specification of boundary conditions or an additional strain energy term to regularize the surface matching term. Optimization is achieved through an accelerated gradient algorithm, further enhanced by our proposed method for determining the optimal step size. We evaluated our method and compared it to several state-of-the-art methods across various datasets. Our straightforward and effective approach consistently outperformed or achieved comparable performance to the state-of-the-art methods. Our code and datasets are available at https://github.com/zixinyang9109/BCF-FEM.

arxiv情報

著者	Zixin Yang,Richard Simon,Kelly Merrell,Cristian. A. Linte
発行日	2025-03-17 15:19:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

投稿日: 2025年3月18日作成者: jarxiv

要約

一貫性のあるフォトリアリスティックな3Dシーンを合成することは、コンピュータービジョンのオープンな問題です。
ビデオ拡散モデルは印象的なビデオを生成しますが、3D表現を直接合成することはできません。つまり、生成されたシーケンスに3D一貫性がありません。
さらに、大規模な3Dトレーニングデータが不足しているため、生成3Dモデルを直接トレーニングすることは困難です。
この作業では、3D表現を事前に訓練した潜在ビデオ拡散モデルと統合する新しいアプローチである生成ガウススプラッティング（GGS）を提示します。
具体的には、私たちのモデルは、3Dガウスプリミティブを介してパラメーター化された機能フィールドを合成します。
機能フィールドは、マップを特徴とするようにレンダリングされ、マルチビュー画像にデコードされるか、3D放射輝度フィールドに直接アップサンプリングされます。
シーン合成の2つの一般的なベンチマークデータセットであるRealestate10KおよびScannet+でアプローチを評価し、提案されたGGSモデルにより、生成されたマルチビュー画像の3D一貫性と、関連するすべてのベースラインにわたって生成された3Dシーンの品質の両方が大幅に改善されることがわかります。
3D表現のない同様のモデルと比較して、GGSは、生成された3DシーンでFIDを改善し、Realestate10KとScannet+の両方で〜20％改善します。
プロジェクトページ：https：//katjaschwarz.github.io/ggs/

要約(オリジナル)

Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) — a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: https://katjaschwarz.github.io/ggs/

arxiv情報

著者	Katja Schwarz,Norman Mueller,Peter Kontschieder
発行日	2025-03-17 15:24:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Artificial Intelligence-Driven Prognostic Classification of COVID-19 Using Chest X-rays: A Deep Learning Approach

投稿日: 2025年3月18日作成者: jarxiv

要約

背景：Covid-19のパンデミックは、医療システムを圧倒し、AI駆動型ツールが迅速かつ正確な患者予後を支援する必要性を強調しています。
胸部X線イメージングは広く利用可能な診断ツールですが、予後分類のための既存の方法にはスケーラビリティと効率がありません。
目的：この研究では、Microsoft Azure Custom Visionで開発された胸部X線画像を使用して、COVID-19の重症度（軽度、中程度、および重度）を分類するための高精度の深い学習モデルを提示します。
方法：1,103のデータセットを使用して、AiforcovidからCOVID-19 X線画像を確認し、畳み込みニューラルネットワーク（CNNS）を活用する深い学習モデルをトレーニングおよび検証しました。
このモデルは、正確さ、精度、およびリコールを測定するために、目に見えないデータセットで評価されました。
結果：私たちのモデルの平均精度は97％で、特異性は99％、感度は87％、F1スコアは93.11％です。
Covid-19の重症度を分類すると、モデルは89.03％（軽度）、95.77％（中程度）、81.16％（重度）の精度を達成しました。
これらの結果は、実世界の臨床アプリケーションのモデルの可能性を示しており、より速い意思決定とリソース割り当ての改善を支援します。
結論：深い学習を使用したAI駆動の予後分類は、COVID-19の患者管理を大幅に強化し、早期介入と効率的なトリアーングを可能にします。
私たちの研究は、深い学習を日常的な臨床ワークフローに統合するためのスケーラブルで高精度のAIフレームワークを提供します。
将来の作業では、臨床採用を促進するために、データセットの拡大、外部検証、規制のコンプライアンスに焦点を当てる必要があります。

要約(オリジナル)

Background: The COVID-19 pandemic has overwhelmed healthcare systems, emphasizing the need for AI-driven tools to assist in rapid and accurate patient prognosis. Chest X-ray imaging is a widely available diagnostic tool, but existing methods for prognosis classification lack scalability and efficiency. Objective: This study presents a high-accuracy deep learning model for classifying COVID-19 severity (Mild, Moderate, and Severe) using Chest X-ray images, developed on Microsoft Azure Custom Vision. Methods: Using a dataset of 1,103 confirmed COVID-19 X-ray images from AIforCOVID, we trained and validated a deep learning model leveraging Convolutional Neural Networks (CNNs). The model was evaluated on an unseen dataset to measure accuracy, precision, and recall. Results: Our model achieved an average accuracy of 97%, with specificity of 99%, sensitivity of 87%, and an F1-score of 93.11%. When classifying COVID-19 severity, the model achieved accuracies of 89.03% (Mild), 95.77% (Moderate), and 81.16% (Severe). These results demonstrate the model’s potential for real-world clinical applications, aiding in faster decision-making and improved resource allocation. Conclusion: AI-driven prognosis classification using deep learning can significantly enhance COVID-19 patient management, enabling early intervention and efficient triaging. Our study provides a scalable, high-accuracy AI framework for integrating deep learning into routine clinical workflows. Future work should focus on expanding datasets, external validation, and regulatory compliance to facilitate clinical adoption.

arxiv情報

著者	Alfred Simbun,Suresh Kumar
発行日	2025-03-17 15:27:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, eess.IV | コメントを受け付けていません

Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection

投稿日: 2025年3月18日作成者: jarxiv

要約

均一なBEVドメインでのLidarと画像機能の融合は、自律運転での3Dオブジェクト検出に人気があります。
ただし、このパラダイムは、過剰な特徴圧縮によって制約されます。
いくつかの作品は密なボクセル融合を探求して、より良い機能の相互作用を可能にしますが、クエリ生成における高い計算コストと課題に直面しています。
さらに、両方のドメインでの機能の不整合により、最適ではない検出精度が発生します。
これらの制限に対処するために、Dual Domainの均質融合ネットワーク（DDHFusion）を提案します。これは、BEVドメインとボクセルドメインの両方を相補的に活用しながら、欠点を軽減します。
具体的には、最初に画像機能をリフトスプラットショットと提案されたセマンティックアウェア機能サンプリング（SAFS）モジュールを使用して、BEVおよびスパースボクセル表現に変換します。
後者は、重要でないボクセルを破棄することにより、計算オーバーヘッドを大幅に削減します。
次に、それぞれのドメイン内でマルチモーダル融合のために、均一なボクセルおよびBev Fusion（HVFおよびHBF）ネットワークを紹介します。
彼らは、特徴の不整合を解決し、包括的なシーンの認識を可能にするために、新しいクロスモーダルマンバブロックを装備しています。
出力ボクセル機能は、直接高さ圧縮によってもたらされる情報の損失を補うためにBEVスペースに注入されます。
クエリの選択中、Progressiveクエリ生成（PQG）メカニズムがBEVドメインに実装され、特徴圧縮によって引き起こされる偽陰性を減らします。
さらに、コンテキストが豊富なBEV特徴だけでなく、変形可能な注意を払うジオメトリを意識したボクセル機能と、正確な分類とボックス回帰のためのマルチモーダルボクセル機能ミキシング（MMVFM）ブロックを順次凝集させるプログレッシブデコーダー（QD）を提案します。

要約(オリジナル)

Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. While some works explore dense voxel fusion to enable better feature interaction, they face high computational costs and challenges in query generation. Additionally, feature misalignment in both domains results in suboptimal detection accuracy. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementarily of both BEV and voxel domains while mitigating their drawbacks. Specifically, we first transform image features into BEV and sparse voxel representations using lift-splat-shot and our proposed Semantic-Aware Feature Sampling (SAFS) module. The latter significantly reduces computational overhead by discarding unimportant voxels. Next, we introduce Homogeneous Voxel and BEV Fusion (HVF and HBF) networks for multi-modal fusion within respective domains. They are equipped with novel cross-modal Mamba blocks to resolve feature misalignment and enable comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the information loss brought by direct height compression. During query selection, the Progressive Query Generation (PQG) mechanism is implemented in the BEV domain to reduce false negatives caused by feature compression. Furthermore, we propose a Progressive Decoder (QD) that sequentially aggregates not only context-rich BEV features but also geometry-aware voxel features with deformable attention and the Multi-Modal Voxel Feature Mixing (MMVFM) block for precise classification and box regression.

arxiv情報

著者	Xuzhong Hu,Zaipeng Duan,Pei An,Jun zhang,Jie Ma
発行日	2025-03-17 15:33:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

TraSCE: Trajectory Steering for Concept Erasure

投稿日: 2025年3月18日作成者: jarxiv

要約

テキストから画像への拡散モデルの最近の進歩により、それらは一般の人々に広くアクセスし、受け入れられるようになりました。
ただし、これらのモデルは、非勤務（NSFW）画像などの有害なコンテンツを生成することが示されています。
モデルからこのような抽象的な概念を消去するためのアプローチが提案されていますが、刑務所を破る技術は、そのような安全対策のバイパスに成功しました。
この論文では、Trasceを提案します。Trasceは、有害なコンテンツの生成から拡散軌跡を導くアプローチを提案します。
私たちのアプローチは否定的なプロンプトに基づいていますが、この論文で示すように、広く使用されているネガティブプロンプト戦略は完全な解決策ではなく、一部のコーナーケースで簡単にバイパスできます。
この問題に対処するために、まず、広く使用されているものではなく、負のプロンプトの特定の定式化を使用することを提案します。
さらに、拡散軌道を操縦することにより、修正された負のプロンプト技術を強化するローカライズされた損失ベースのガイダンスを導入します。
提案された方法は、赤チームが提案したり、芸術的なスタイルやオブジェクトを消去したりしたものを含む、有害なコンテンツを削除する際に、さまざまなベンチマークで最新の結果を達成することを実証します。
提案されたアプローチでは、トレーニング、重量変更、またはトレーニングデータ（画像またはプロンプトのいずれか）を必要としないため、モデル所有者が新しい概念を消去しやすくします。

要約(オリジナル)

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, a widely used negative prompting strategy is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose using a specific formulation of negative prompting instead of the widely used one. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content, including ones proposed by red teams, and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (either image or prompt), making it easier for model owners to erase new concepts.

arxiv情報

著者	Anubhav Jain,Yuya Kobayashi,Takashi Shibuya,Yuhta Takida,Nasir Memon,Julian Togelius,Yuki Mitsufuji
発行日	2025-03-17 15:37:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント