jarxiv | Japanese arxiv

Improving Surgical Risk Prediction Through Integrating Automated Body Composition Analysis: a Retrospective Trial on Colectomy Surgery

投稿日: 2025年6月16日作成者: jarxiv

要約

目的：CTスキャンから術前の体組成メトリックが自動的に抽出されたかどうかを評価することで、単独または既存のリスク予測因子と組み合わされた単独または既存のリスク予測因子と組み合わせた結腸切除後の術後転帰を予測できるかどうかを評価する。
主な結果と測定：主な結果は、結腸切除後の1年間の全死因死亡率の予測パフォーマンスでした。
1年間の追跡調査を伴うCox比例ハザードモデルが使用され、パフォーマンスが一致インデックス（C-Index）と統合Brierスコア（IBS）を使用して評価されました。
二次的な結果には、術後の合併症、予定外の再入院、輸血、および重度の感染が含まれ、ロジスティック回帰のAUCおよびBrierスコアを使用して評価されました。
個々のCT由来の体組成メトリックと結果の間の関連性（または）のオッズ比（または）。
骨格筋領域、密度、脂肪領域、および組織間メトリックなど、複数の椎骨レベルにわたって術前CTSから300を超える特徴が抽出されました。
NSQIPスコアは、2012年以降、すべての手術で利用できました。

要約(オリジナル)

Objective: To evaluate whether preoperative body composition metrics automatically extracted from CT scans can predict postoperative outcomes after colectomy, either alone or combined with clinical variables or existing risk predictors. Main outcomes and measures: The primary outcome was the predictive performance for 1-year all-cause mortality following colectomy. A Cox proportional hazards model with 1-year follow-up was used, and performance was evaluated using the concordance index (C-index) and Integrated Brier Score (IBS). Secondary outcomes included postoperative complications, unplanned readmission, blood transfusion, and severe infection, assessed using AUC and Brier Score from logistic regression. Odds ratios (OR) described associations between individual CT-derived body composition metrics and outcomes. Over 300 features were extracted from preoperative CTs across multiple vertebral levels, including skeletal muscle area, density, fat areas, and inter-tissue metrics. NSQIP scores were available for all surgeries after 2012.

arxiv情報

著者	Hanxue Gu,Yaqian Chen,isoo Lee,Diego Schaps,Regina Woody,Roy Colglazier,Maciej A. Mazurowski,Christopher Mantyh
発行日	2025-06-13 17:51:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

crossMoDA Challenge: Evolution of Cross-Modality Domain Adaptation Techniques for Vestibular Schwannoma and Cochlea Segmentation from 2021 to 2023

投稿日: 2025年6月16日作成者: jarxiv

要約

医療画像コンピューティングとコンピューター支援介入（MICCAI）に関する国際会議と併せて2021年に開始されたクロスモダリティドメイン適応（CrossModa）チャレンジシリーズは、委任されていないクロスモダリティセグメンテーションに焦点を当て、造影T1（CET1）から学習し、T2 MRIに移行します。
このタスクは、意味のある実例ベンチマークとして機能するように選択されたドメインシフトの極端な例です。
臨床用途の観点からは、より費用対効果の高いVS管理のために、T2スキャンの前庭シュワノーマ（VS）とCochleaセグメンテーションを自動化することを目的としています。
時間が経つにつれて、課題の目的は進化して臨床的関連性を高めています。
この課題は、2021年の単一制度データと基本的なセグメンテーションを使用して、2022年に多施設データとKooSグレーディングを組み込むことから進化し、2023年までに、不均一なルーチンデータと、骨ine骨外腫瘍成分のサブセグメンテーションが含まれていました。
この作業では、2022年および2023年版の調査結果を報告し、長年にわたるチャレンジ進行の回顧的分析を実行します。
連続した課題の貢献からの観察は、拡大するデータセットで外れ値の数が減少することを示しています。
これは、データセットのスキャンプロトコルの多様性が同時に増加したため、注目に値します。
2023年版の勝利アプローチにより、2021年と2022年のテストデータの外れ値の数が減少し、データの不均一性が均一なデータでもセグメンテーションパフォーマンスを向上させる方法を示しました。
ただし、2023年にCochlea Diceスコアは減少しました。これは、セグメンテーションの全体的なパフォーマンスに影響を与える腫瘍サブアノテーションからの複雑さが追加されたためです。
臨床的に受け入れられるセグメンテーションとセグメンテーションにはまだ進行が必要ですが、プラトーのパフォーマンスは、より挑戦的なクロスモーダルタスクが将来のベンチマークに適している可能性があることを示唆しています。

要約(オリジナル)

The cross-Modality Domain Adaptation (crossMoDA) challenge series, initiated in 2021 in conjunction with the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), focuses on unsupervised cross-modality segmentation, learning from contrast-enhanced T1 (ceT1) and transferring to T2 MRI. The task is an extreme example of domain shift chosen to serve as a meaningful and illustrative benchmark. From a clinical application perspective, it aims to automate Vestibular Schwannoma (VS) and cochlea segmentation on T2 scans for more cost-effective VS management. Over time, the challenge objectives have evolved to enhance its clinical relevance. The challenge evolved from using single-institutional data and basic segmentation in 2021 to incorporating multi-institutional data and Koos grading in 2022, and by 2023, it included heterogeneous routine data and sub-segmentation of intra- and extra-meatal tumour components. In this work, we report the findings of the 2022 and 2023 editions and perform a retrospective analysis of the challenge progression over the years. The observations from the successive challenge contributions indicate that the number of outliers decreases with an expanding dataset. This is notable since the diversity of scanning protocols of the datasets concurrently increased. The winning approach of the 2023 edition reduced the number of outliers on the 2021 and 2022 testing data, demonstrating how increased data heterogeneity can enhance segmentation performance even on homogeneous data. However, the cochlea Dice score declined in 2023, likely due to the added complexity from tumour sub-annotations affecting overall segmentation performance. While progress is still needed for clinically acceptable VS segmentation, the plateauing performance suggests that a more challenging cross-modal task may better serve future benchmarking.

arxiv情報

著者	Navodini Wijethilake,Reuben Dorent,Marina Ivory,Aaron Kujawa,Stefan Cornelissen,Patrick Langenhuizen,Mohamed Okasha,Anna Oviedova,Hexin Dong,Bogyeong Kang,Guillaume Sallé,Luyi Han,Ziyuan Zhao,Han Liu,Tao Yang,Shahad Hardan,Hussain Alasmawi,Santosh Sanjeev,Yuzhou Zhuang,Satoshi Kondo,Maria Baldeon Calisto,Shaikh Muhammad Uzair Noman,Cancan Chen,Ipek Oguz,Rongguo Zhang,Mina Rezaei,Susana K. Lai-Yuen,Satoshi Kasai,Chih-Cheng Hung,Mohammad Yaqub,Lisheng Wang,Benoit M. Dawant,Cuntai Guan,Ritse Mann,Vincent Jaouen,Ji-Wung Han,Li Zhang,Jonathan Shapey,Tom Vercauteren
発行日	2025-06-13 17:56:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts

投稿日: 2025年6月16日作成者: jarxiv

要約

部分的な微分方程式（PDE）の神経代理人は、新しい材料の種類や構造的次元などの目に見えない問題構成で評価されると、多くの場合、重大な性能劣化に苦しむことがよくあります。
一方、ドメイン適応（DA）手法は、目に見えない構成に関する限られた情報から一般化するために、ビジョンと言語の処理で広く使用されています。
この作業では、2つの焦点を絞った貢献を通じてこのギャップに対処します。
まず、4つの産業シミュレーションタスクで構成される新しいベンチマークデータセットと評価スイートであるSimShiftを紹介します。
第二に、確立されたドメイン適応方法を最先端の神経の代理人に拡張し、それらを体系的に評価します。
これらのアプローチでは、複数のソース構成からのパラメトリックな説明とグラウンドトゥルースシミュレーションを使用し、ターゲット構成からのパラメトリック説明のみを使用します。
目標は、グラウンドトゥルースシミュレーションデータにアクセスすることなく、ターゲットシミュレーションを正確に予測することです。
SimShiftに関する広範な実験は、分布からの神経代理モデリングの課題の課題を強調し、シミュレーションにおけるDAの可能性を示し、産業的に関連するシナリオの分布シフトの下で堅牢な神経代理を達成する際の未解決の問題を明らかにします。
コードベースはhttps://github.com/psetinek/simshiftで入手できます

要約(オリジナル)

Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on unseen problem configurations, such as novel material types or structural dimensions. Meanwhile, Domain Adaptation (DA) techniques have been widely used in vision and language processing to generalize from limited information about unseen configurations. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established domain adaptation methods to state of the art neural surrogates and systematically evaluate them. These approaches use parametric descriptions and ground truth simulations from multiple source configurations, together with only parametric descriptions from target configurations. The goal is to accurately predict target simulations without access to ground truth simulation data. Extensive experiments on SIMSHIFT highlight the challenges of out of distribution neural surrogate modeling, demonstrate the potential of DA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift

arxiv情報

著者	Paul Setinek,Gianluca Galletti,Thomas Gross,Dominik Schnürer,Johannes Brandstetter,Werner Zellinger
発行日	2025-06-13 17:56:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, physics.comp-ph | コメントを受け付けていません

Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale

投稿日: 2025年6月16日作成者: jarxiv

要約

相互作用の自然言語の説明に基づいたアフォーダンス接地局在オブジェクト領域 – インテリジェントなエージェントが環境を理解し、相互作用できるようにするための重要な課題です。
ただし、このタスクは、微調整された部分レベルのローカリゼーション、複数の有効な相互作用領域から生じるあいまいさ、および大規模なデータセットの不足のために困難なままです。
この作業では、150Kインスタンスを含む大規模なベンチマークである150Kインスタンスで構成される大規模なベンチマークを紹介します。これは、オブジェクトと相互作用の多様なセットにわたって、オープンボキャブラリーテキストの説明と対応する3Dアフォーダンスヒートマップを注釈します。
このベンチマークに基づいて、当社は、前提条件のパートアウェアビジョンバックボーンとテキストコンディショナルヒートマップデコーダーを活用するシンプルで効果的なビジョン言語モデルを開発しています。
Affogato Datasetでトレーニングされたモデルは、既存の2Dおよび3Dベンチマークで有望なパフォーマンスを実現し、特に音量のないクロスドメイン一般化において有効性を示します。
Affogato Datasetは、https：//huggingface.co/datasets/project-affogato/affogatoで共有されています

要約(オリジナル)

Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato

arxiv情報

著者	Junha Lee,Eunha Park,Chunghyun Park,Dahyun Kang,Minsu Cho
発行日	2025-06-13 17:57:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction

投稿日: 2025年6月16日作成者: jarxiv

要約

オープンソースの基礎モデルでは、迅速な採用と開発が見られ、多様なドメイン全体で強力な汎用機能が可能になりました。
ただし、ドメイン固有またはパーソナライズされたタスクの大規模なファンデーションモデルを微調整すると、推論のオーバーヘッドを超えて重要なメモリのオーバーヘッドがあるため、ほとんどのユーザーにとっては非常に高価です。
LORA補正を備えたエミュレーターベースのメモリ効率の高い微調整フレームワークであるEmlocを紹介します。これにより、推論に必要な同じメモリ予算内でモデルの微調整が可能になります。
EMLOCは、小さな下流のキャリブレーションセットにアクティベーションを認識した単一値分解（SVD）を使用して、タスク固有の光重量エミュレーターを構築します。
微調整は、LORAを介してこの軽量エミュレータで実行されます。
元のモデルと圧縮エミュレータの間の不整合に取り組むために、微調整されたLORAモジュールを修正するための新しい補償アルゴリズムを提案します。
EMLOCは、柔軟な圧縮比と標準トレーニングパイプラインをサポートしており、幅広いアプリケーションに適応できます。
広範な実験は、EMLOCが複数のデータセットとモダリティにわたって他のベースラインよりも優れていることを示しています。
さらに、量子化なしでは、EMLOCは、単一の24GBの消費者GPU繁殖効率の効率的かつ実用的なモデル適応で38Bモデルの微調整を可能にします。

要約(オリジナル)

Open-source foundation models have seen rapid adoption and development, enabling powerful general-purpose capabilities across diverse domains. However, fine-tuning large foundation models for domain-specific or personalized tasks remains prohibitively expensive for most users due to the significant memory overhead beyond that of inference. We introduce EMLoC, an Emulator-based Memory-efficient fine-tuning framework with LoRA Correction, which enables model fine-tuning within the same memory budget required for inference. EMLoC constructs a task-specific light-weight emulator using activation-aware singular value decomposition (SVD) on a small downstream calibration set. Fine-tuning then is performed on this lightweight emulator via LoRA. To tackle the misalignment between the original model and the compressed emulator, we propose a novel compensation algorithm to correct the fine-tuned LoRA module, which thus can be merged into the original model for inference. EMLoC supports flexible compression ratios and standard training pipelines, making it adaptable to a wide range of applications. Extensive experiments demonstrate that EMLoC outperforms other baselines across multiple datasets and modalities. Moreover, without quantization, EMLoC enables fine-tuning of a 38B model on a single 24GB consumer GPU-bringing efficient and practical model adaptation to individual users.

arxiv情報

著者	Hsi-Che Lin,Yu-Chu Yu,Kai-Po Chang,Yu-Chiang Frank Wang
発行日	2025-06-13 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

RationalVLA: A Rational Vision-Language-Action Model with Dual System

投稿日: 2025年6月16日作成者: jarxiv

要約

現実世界のロボット展開の基本的な要件は、自然言語の指示を理解し、対応する能力です。
既存の言語条件付き操作タスクは、通常、指示が環境と完全に一致していると仮定します。
この仮定は、指示が曖昧、無関係、または実行不可能である可能性のある現実的なシナリオでの堅牢性と一般化を制限します。
この問題に対処するために、合理的な操作（Rama）を紹介します。これは、目に見えない実行可能な指示と拒否されるべき欠陥のある指示の両方でモデルに挑戦する新しいベンチマークです。
RAMAでは、視覚、物理、セマンティック、モーション、安全性、コンテキスト外の6つの次元にまたがる多様な欠陥のある指示を含む、14,000を超えるサンプルを含むデータセットを構築します。
さらに、合理的なビジョン言語アクションモデル（RationalVLA）を提案します。
これは、学習可能な潜在スペース埋め込みを導入することにより、高レベルの視覚言語モデルと低レベルの操作ポリシーを統合するロボットアームのデュアルシステムです。
この設計により、RationalVLAは指示を推論し、実行不可能なコマンドを拒否し、操作を効果的に実行できます。
実験は、RationalVLAが標準的な操作タスクの競争力を維持しながら、Ramaの最先端のベースラインを14.5％高い成功率と0.94の平均タスク長さよりも優れていることを示しています。
実際の試験では、実際のアプリケーションにおける有効性と堅牢性をさらに検証します。
プロジェクトページはhttps://irpn-eai.github.io/rationalvlaです。

要約(オリジナル)

A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications. Our project page is https://irpn-eai.github.io/RationalVLA.

arxiv情報

著者	Wenxuan Song,Jiayi Chen,Wenxue Li,Xu He,Han Zhao,Can Cui,Pengxiang Ding Shiyan Su,Feilong Tang,Xuelian Cheng,Donglin Wang,Zongyuan Ge,Xinhu Zheng,Zhe Liu,Hesheng Wang,Haoang Li
発行日	2025-06-13 12:14:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

Persistent Topological Features in Large Language Models

投稿日: 2025年6月16日作成者: jarxiv

要約

大規模な言語モデルの意思決定プロセスを理解することは、広範なアプリケーションを考えると重要です。
これを達成するために、正式な数学的フレームワーク – トポロジーデータ分析からのジグザグの持続性を、実用的で簡単に適用可能なアルゴリズムと結びつけることを目指しています。
Zigzagの持続性は、モデルレイヤー間で動的に変換されるため、データを特徴付けるのに特に効果的です。
このフレームワーク内で、トポロジー特徴、$ p $ dimensionalの穴が層全体にどのように持続し、進化するかを測定するトポロジー記述子を紹介します。
各レイヤーを個別に評価してから結果を集約する方法とは異なり、私たちのアプローチは、これらの機能の完全な進化パスを直接追跡します。
これにより、プロンプトがどのように再配置され、その相対的な位置が表現空間で変化するかについての統計的な視点が提供され、統合された全体としてのシステムの操作に関する洞察を提供します。
フレームワークの表現性と適用性を実証するために、これらの記述子がさまざまなモデルとさまざまなデータセットにどれほど敏感であるかを強調します。
ダウンストリームタスクへのショーケースアプリケーションとして、Zigzag Persistenceを使用して層剪定の基準を確立し、システムレベルの視点を維持しながら最新の方法に匹敵する結果を達成します。

要約(オリジナル)

Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework – zigzag persistence from topological data analysis – with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system’s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.

arxiv情報

著者	Yuri Gardinazzi,Karthik Viswanathan,Giada Panerai,Alessio Ansuini,Alberto Cazzaniga,Matteo Biagetti
発行日	2025-06-13 12:27:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CG, cs.CL, cs.LG | コメントを受け付けていません

PiPViT: Patch-based Visual Interpretable Prototypes for Retinal Image Analysis

投稿日: 2025年6月16日作成者: jarxiv

要約

背景と目的：プロトタイプベースの方法は、細粒のパートプロトタイプを学習することにより、解釈性を向上させます。
ただし、入力ピクセル空間での視覚化は、人間の理解可能なバイオマーカーと常に一致するわけではありません。
さらに、よく知られているプロトタイプベースのアプローチは、通常、バイオマーカーと病変の存在と範囲の両方が重要である医療イメージングではあまり解釈できない非常に細いプロトタイプを学習します。
方法：これらの課題に対処するために、画像認識のための本質的に解釈可能なプロトタイプモデルであるPIPVIT（パッチベースの視覚的解釈可能なプロトタイプ）を提案します。
ビジョントランス（VIT）を活用して、PIPVITはパッチ間で長距離依存関係をキャプチャして、画像レベルのラベルを使用してのみ病変範囲を近似する堅牢で人間の解釈可能なプロトタイプを学習します。
さらに、PIPVITは、コントラストの学習と多解像度入力処理の恩恵を受け、スケール全体のバイオマーカーの効果的なローカリゼーションを可能にします。
結果：4つのデータセットで網膜OCT画像分類でPIPVITを評価しました。ここでは、より意味のある説明を提供しながら、最先端の方法と比較して競争力のある定量的パフォーマンスを達成しました。
さらに、ホールドアウトテストセットの定量的評価は、学習したプロトタイプが意味的および臨床的に関連していることを確認しています。
PIPVITは、その決定を透過的に説明し、臨床医が診断結果を理解するのを支援できると考えています。
githubページ：https：//github.com/marziehoghbaie/pipvit

要約(オリジナル)

Background and Objective: Prototype-based methods improve interpretability by learning fine-grained part-prototypes; however, their visualization in the input pixel space is not always consistent with human-understandable biomarkers. In addition, well-known prototype-based approaches typically learn extremely granular prototypes that are less interpretable in medical imaging, where both the presence and extent of biomarkers and lesions are critical. Methods: To address these challenges, we propose PiPViT (Patch-based Visual Interpretable Prototypes), an inherently interpretable prototypical model for image recognition. Leveraging a vision transformer (ViT), PiPViT captures long-range dependencies among patches to learn robust, human-interpretable prototypes that approximate lesion extent only using image-level labels. Additionally, PiPViT benefits from contrastive learning and multi-resolution input processing, which enables effective localization of biomarkers across scales. Results: We evaluated PiPViT on retinal OCT image classification across four datasets, where it achieved competitive quantitative performance compared to state-of-the-art methods while delivering more meaningful explanations. Moreover, quantitative evaluation on a hold-out test set confirms that the learned prototypes are semantically and clinically relevant. We believe PiPViT can transparently explain its decisions and assist clinicians in understanding diagnostic outcomes. Github page: https://github.com/marziehoghbaie/PiPViT

arxiv情報

著者	Marzieh Oghbaie,Teresa Araújo,Hrvoje Bogunović
発行日	2025-06-13 08:57:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications

投稿日: 2025年6月16日作成者: jarxiv

要約

多様な環境とドメインでの堅牢なナビゲーションには、正確な状態推定と透明な意思決定の両方が必要です。
PhysNav-DGは、古典的なセンサーの融合をビジョン言語モデルのセマンティックパワーと統合する新しいフレームワークです。
デュアルブランチアーキテクチャは、マルチセンサー入力からのナビゲーションアクションを予測し、同時に詳細な考え方の説明を生成します。
修正された適応型カルマンフィルターは、環境コンテキストに基づいてノイズパラメーターを動的に調整します。
Llama 3.2 11bやBlip-2などのモデルからのセマンティックな洞察とともに、生センサーデータのいくつかのストリームを活用します。
アプローチを評価するために、屋内ナビゲーション、自律運転、および地上の真実のアクションと人間の検証の説明を備えた社会的ナビゲーションタスクを統一する新しいマルチドメインデータセットであるMD-Nexベンチマークを紹介します。
広範な実験とアブレーションは、PhysNAV-DGがナビゲーションの成功率を20％以上改善し、高効率を達成することを示しており、説明は非常に根拠があり、明確な説明があります。
この作業は、より安全で信頼できる自律システムのために、高レベルのセマンティック推論と幾何学的計画をつなぎます。

要約(オリジナル)

Robust navigation in diverse environments and domains requires both accurate state estimation and transparent decision making. We present PhysNav-DG, a novel framework that integrates classical sensor fusion with the semantic power of vision-language models. Our dual-branch architecture predicts navigation actions from multi-sensor inputs while simultaneously generating detailed chain-of-thought explanations. A modified Adaptive Kalman Filter dynamically adjusts its noise parameters based on environmental context. It leverages several streams of raw sensor data along with semantic insights from models such as LLaMA 3.2 11B and BLIP-2. To evaluate our approach, we introduce the MD-NEX Benchmark, a novel multi-domain dataset that unifies indoor navigation, autonomous driving, and social navigation tasks with ground-truth actions and human-validated explanations. Extensive experiments and ablations show that PhysNav-DG improves navigation success rates by over 20% and achieves high efficiency, with explanations that are both highly grounded and clear. This work connects high-level semantic reasoning and geometric planning for safer and more trustworthy autonomous systems.

arxiv情報

著者	Trisanth Srinivasan,Santosh Patapati
発行日	2025-06-13 03:36:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG, cs.MM, cs.RO | コメントを受け付けていません

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

投稿日: 2025年6月16日作成者: jarxiv

要約

拡散ベースの言語モデル（DLLM）は、並列トークンの生成を有効にし、推論潜時を大幅に削減することにより、従来の自己回帰LLMの有望な代替として浮上しています。
ただし、信頼性や半自動性のデコードなどのDLLMの既存のサンプリング戦略は、しばしば静的な動作に悩まされ、最適ではない効率と柔軟性が限られています。
このホワイトペーパーでは、探索的デコード段階と加速デコード段階を適応的に交代する新しい動的サンプリング戦略であるSlowfastサンプリングを提案します。
私たちの方法は、3つの黄金の原則に導かれます。確実性の原則、収束原理、およびポジショナル原則は、いつ、どこでトークンを自信を持って効率的に解読できるかを支配します。
さらに、戦略をDLLM-Cacheと統合して、冗長計算を削減します。
ベンチマークとモデル全体の広範な実験では、スローファーストサンプリングが最小限の精度低下で最大15.63 $ \ Times $ speedupを達成し、キャッシュと組み合わせた場合は最大34.22 $ \ Times $を達成することが示されています。
特に、私たちのアプローチは、スループットのLLAMA3 8Bのような強力な自己回帰ベースラインよりも優れており、適切に設計されたサンプリングが高速および高品質の生成のDLLMの最大限の可能性を解き放つことができることを示しています。

要約(オリジナル)

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

arxiv情報

著者	Qingyan Wei,Yaojie Zhang,Zhiyuan Liu,Dongrui Liu,Linfeng Zhang
発行日	2025-06-13 02:28:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント