jarxiv | Japanese arxiv | ページ 953

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

投稿日: 2025年4月14日作成者: jarxiv

要約

幾何学的に正確で意味的に表現力のある地図表現は、堅牢で安全なモバイルロボットナビゲーションとタスク計画を促進するために非常に貴重であることが証明されています。
それにもかかわらず、大規模な未知の環境のリアルタイムのオープンボキャブラリーセマンティック理解は、依然としてオープンな問題です。
このペーパーでは、視覚言語情報を密集した体積サブマップに組み込んだオープンワールドマッピングおよび探索フレームワークであるFindanythingを紹介します。
ビジョン言語機能の使用のおかげで、Findanythingは、純粋な幾何学的な意味情報と、より高いレベルの理解のための純粋な幾何学的セマンティック情報のギャップを橋渡ししながら、グラウンドトゥルースのポーズ情報の外部ソースの助けを借りずに環境を探索できます。
環境を一連の体積占有サブマップとして表し、基礎となるスラムシステムがドリフトを修正したときにポーズ更新時に変形する堅牢で正確なマップ表現をもたらし、サブマップ間の局所的に一貫した表現を可能にします。
ピクセルごとのビジョン言語機能は、効率的なSAM（ESAM）生成セグメントから集計されており、オブジェクト中心の体積サブマップに統合され、オープンボキャブラリークエリからメモリ使用量の面でもスケーラブルな3Dジオメトリまでのマッピングを提供します。
Findanythingのオープンボキャブラリーマップ表現は、レプリカデータセットのクローズドセット評価で最先端のセマンティック精度を実現します。
このレベルのシーン理解により、ロボットは、自然言語クエリを介して選択されたオブジェクトまたは関心のある領域に基づいて環境を探索できます。
私たちのシステムは、MAVSなどのリソース制約のあるデバイスに展開されたこの種の最初のものであり、実際のロボットタスクのビジョン言語情報を活用しています。

要約(オリジナル)

Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

arxiv情報

著者	Sebastián Barbas Laina,Simon Boche,Sotiris Papatheodorou,Simon Schaefer,Jaehyung Jung,Stefan Leutenegger
発行日	2025-04-11 15:12:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.RO | コメントを受け付けていません

Enhancing knowledge retention for continual learning with domain-specific adapters and features gating

投稿日: 2025年4月14日作成者: jarxiv

要約

継続的な学習により、モデルは、以前に獲得した知識を保存しながら、壊滅的な忘却の課題に効果的に対処しながら、データの継続的なストリームから学習することができます。
この研究では、異なるドメインからデータセットを順次追加するときに、視覚変圧器の自己触媒メカニズム内にアダプターを統合して知識保持を強化する新しいアプローチを提案します。
1つのデータセットのみで学習を継続する以前の方法とは異なり、このアプローチはドメイン固有の出力ヘッドと特徴ゲーティングを導入し、以前に学習したタスクの高い精度を維持しながら、複数のドメインからの重要な情報のみを組み込みます。
提案された方法は、現在のART最新の最新のパラメーター効率の高い微調整方法と比較されます。
結果は、私たちの方法が以前の作品の限界を効果的に軽減するという証拠を提供します。
さらに、モデルのパフォーマンスに対するタスク順序の影響を調査するために、それぞれCIFAR-100、Flowers102、およびDTDの3つのデータセット、CIFAR-100、Flowers102、およびDTDを使用して比較分析を実施します。
私たちの調査結果は、学習成果の形成におけるデータセットシーケンスの重要な役割を強調しており、戦略的注文が以前に学んだ知識の完全性を維持しながら、経時的なデータ分布に適応するモデルの能力を大幅に改善できることを示しています。

要約(オリジナル)

Continual learning empowers models to learn from a continuous stream of data while preserving previously acquired knowledge, effectively addressing the challenge of catastrophic forgetting. In this study, we propose a new approach that integrates adapters within the self-attention mechanisms of Vision Transformers to enhance knowledge retention when sequentially adding datasets from different domains. Unlike previous methods that continue learning with only one dataset, our approach introduces domain-specific output heads and feature gating, allowing the model to maintain high accuracy on previously learned tasks while incorporating only the essential information from multiple domains. The proposed method is compared to prominent parameter-efficient fine-tuning methods in the current state of the art. The results provide evidence that our method effectively alleviates the limitations of previous works. Furthermore, we conduct a comparative analysis using three datasets, CIFAR-100, Flowers102, and DTD, each representing a distinct domain, to investigate the impact of task order on model performance. Our findings underscore the critical role of dataset sequencing in shaping learning outcomes, demonstrating that strategic ordering can significantly improve the model’s ability to adapt to evolving data distributions over time while preserving the integrity of previously learned knowledge.

arxiv情報

著者	Mohamed Abbas Hedjazi,Oussama Hadjerci,Adel Hafiane
発行日	2025-04-11 15:20:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition

投稿日: 2025年4月14日作成者: jarxiv

要約

手書きのテキスト認識（HTR）は、ドキュメント分析とデジタル化に不可欠です。
ただし、手書きのデータには、ユニークな手書きスタイルや個人用レキシコンの選択など、ユーザーの識別可能な情報が含まれていることがよくあり、プライバシーを損ない、AIサービスの信頼を侵食できます。
「忘れられる権利」のような法律は、訓練されたモデルから機密情報を抹消できる方法の必要性を強調しています。
Machine Ulearningは、完全な再訓練を必要とせずにモデルから特定のデータを選択的に削除することにより、これに対処します。
しかし、それは頻繁にプライバシーとアカウムのトレードオフに遭遇し、プライバシーを保護することでモデルのパフォーマンスが低下します。
このペーパーでは、剪定とランダムラベルを統合するマルチヘッドトランスベースのHTRモデルのための新しい2段階の未学習戦略を紹介します。
提案された方法では、認識ヘッドの有効性を維持しながら、学生の分類ヘッドをインジケーターとトリガーの両方のトリガーとして使用します。
私たちの知る限り、これはHTRタスク内でのマシンの学習の最初の包括的な調査を表しています。
さらに、メンバーシップ推論攻撃（MIA）を採用して、ユーザーの識別可能な情報を解除することの有効性を評価します。
広範な実験は、私たちのアプローチがモデルの精度を維持しながらプライバシーを効果的に維持し、ドキュメント分析コミュニティの新しい研究方向への道を開いていることを示しています。
私たちのコードは、受け入れられると公開されます。

要約(オリジナル)

Handwritten Text Recognition (HTR) is essential for document analysis and digitization. However, handwritten data often contains user-identifiable information, such as unique handwriting styles and personal lexicon choices, which can compromise privacy and erode trust in AI services. Legislation like the “right to be forgotten” underscores the necessity for methods that can expunge sensitive information from trained models. Machine unlearning addresses this by selectively removing specific data from models without necessitating complete retraining. Yet, it frequently encounters a privacy-accuracy tradeoff, where safeguarding privacy leads to diminished model performance. In this paper, we introduce a novel two-stage unlearning strategy for a multi-head transformer-based HTR model, integrating pruning and random labeling. Our proposed method utilizes a writer classification head both as an indicator and a trigger for unlearning, while maintaining the efficacy of the recognition head. To our knowledge, this represents the first comprehensive exploration of machine unlearning within HTR tasks. We further employ Membership Inference Attacks (MIA) to evaluate the effectiveness of unlearning user-identifiable information. Extensive experiments demonstrate that our approach effectively preserves privacy while maintaining model accuracy, paving the way for new research directions in the document analysis community. Our code will be publicly available upon acceptance.

arxiv情報

著者	Lei Kang,Xuanshuo Fu,Lluis Gomez,Alicia Fornés,Ernest Valveny,Dimosthenis Karatzas
発行日	2025-04-11 15:21:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Efficient Mixture of Geographical Species for On Device Wildlife Monitoring

投稿日: 2025年4月14日作成者: jarxiv

要約

効率的なオンデバイスモデルは、生態学的保全コミュニティにとって特に興味深い、ほぼセンサーの洞察生成にとって魅力的になりました。
このため、深い学習研究者は、より低い計算モデルを開発するためのより多くのアプローチを提案しています。
ただし、Vision TransformersはEdge Use Caseにとって非常に新しいため、入力データに基づいたサブネットワークの条件付きの実行はまだ未開拓のアプローチがあります。
この作業では、条件付き計算を使用して、地理的に認識された方法で構造化されたサブネットワークをバイアスする単一の種検出器のトレーニングを探ります。
場所ごとにエキスパートモデルを剪定する方法を提案し、2つの地理的に分散したデータセットの条件付き計算パフォーマンスを実証します：InaturalistとIwildcam。

要約(オリジナル)

Efficient on-device models have become attractive for near-sensor insight generation, of particular interest to the ecological conservation community. For this reason, deep learning researchers are proposing more approaches to develop lower compute models. However, since vision transformers are very new to the edge use case, there are still unexplored approaches, most notably conditional execution of subnetworks based on input data. In this work, we explore the training of a single species detector which uses conditional computation to bias structured sub networks in a geographically-aware manner. We propose a method for pruning the expert model per location and demonstrate conditional computation performance on two geographically distributed datasets: iNaturalist and iWildcam.

arxiv情報

著者	Emmanuel Azuh Mensah,Joban Mand,Yueheng Ou,Min Jang,Kurtis Heimerl
発行日	2025-04-11 15:25:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Task-conditioned Ensemble of Expert Models for Continuous Learning

投稿日: 2025年4月14日作成者: jarxiv

要約

機械学習における主要な課題の1つは、非定常環境で展開モデル（分類器など）の精度を維持することです。
非定常環境は、分布のシフトをもたらし、その結果、精度が低下します。
新しいデータを使用して展開されたモデルを継続的に学習することは、1つの救済策になる可能性があります。
ただし、新しいトレーニングデータでモデルを更新する方法について疑問が生じ、新しいデータに適応しながら古いデータの精度を保持するようにします。
この作業では、既存のモデルのパフォーマンスを維持するために、モデルのタスク条件付きアンサンブルを提案します。
この方法には、タスクメンバーシップ情報に基づいたエキスパートモデルのアンサンブルが含まれます。
ローカルの外れ値の概念（エキスパートモデルとは異なる）に基づくドメイン内モデルは、各プローブサンプルに実行時に動的にタスクメンバーシップ情報を提供します。
提案された方法を評価するために、3つのセットアップを実験します。1つ目はタスク間の分布シフト（Livdet-IRIS-2017）を表し、2つ目はタスク間と内側（livdet-iris-2020）の両方の分布シフトを表し、3番目はタスク間の分離分布を表します（スプリットMnist）。
実験は、提案された方法の利点を強調しています。
ソースコードは、https：//github.com/iprobe-lab/continuous_learning_fe_dmで入手できます。

要約(オリジナル)

One of the major challenges in machine learning is maintaining the accuracy of the deployed model (e.g., a classifier) in a non-stationary environment. The non-stationary environment results in distribution shifts and, consequently, a degradation in accuracy. Continuous learning of the deployed model with new data could be one remedy. However, the question arises as to how we should update the model with new training data so that it retains its accuracy on the old data while adapting to the new data. In this work, we propose a task-conditioned ensemble of models to maintain the performance of the existing model. The method involves an ensemble of expert models based on task membership information. The in-domain models-based on the local outlier concept (different from the expert models) provide task membership information dynamically at run-time to each probe sample. To evaluate the proposed method, we experiment with three setups: the first represents distribution shift between tasks (LivDet-Iris-2017), the second represents distribution shift both between and within tasks (LivDet-Iris-2020), and the third represents disjoint distribution between tasks (Split MNIST). The experiments highlight the benefits of the proposed method. The source code is available at https://github.com/iPRoBe-lab/Continuous_Learning_FE_DM.

arxiv情報

著者	Renu Sharma,Debasmita Pal,Arun Ross
発行日	2025-04-11 15:27:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging

投稿日: 2025年4月14日作成者: jarxiv

要約

この研究では、ケーススタディとしてADNIデータベースからの脳MRを使用したアルツハイマー病（AD）に焦点を当てた、効率的かつ意味のない監視されていない学習のための新しいエンコーダーデコーダー拡散ベースのフレームワークである潜在的な拡散自己エンコーダー（LDAE）を提示します。
イメージスペースで動作する従来の拡散自動エンコーダーとは異なり、LDAEは圧縮された潜在表現で拡散プロセスを適用し、計算効率を改善し、3D医療イメージング表現学習を扱いやすくします。
提案されたアプローチを検証するために、2つの重要な仮説を調査します。（i）LDAEは、ADおよび老化に関連する3D脳MRで意味のあるセマンティック表現を効果的にキャプチャし、（ii）LDAEは計算上効率的でありながら高品質の画像生成と再構築を達成します。
実験結果は両方の仮説をサポートします。（i）線形プローブの評価は、ADの有望な診断パフォーマンス（ROC-AUC：90％、ACC：84％）および年齢予測（MAE：4.1歳、RMSE：5.2歳）の有望な診断パフォーマンスを示しています。
（ii）学習したセマンティック表現は、属性操作を可能にし、解剖学的にもっともらしい修正をもたらします。
（iii）セマンティック補間実験では、6か月のギャップでSSIMが0.969（MSE：0.0019）の欠落スキャンの強い再構築を示しています。
長いギャップ（24か月）であっても、モデルは堅牢なパフォーマンス（SSIM> 0.93、MSE <0.004）を維持し、時間的進行の傾向をキャプチャする能力を示しています。（iv）従来の拡散自己エンコーダーと比較して、LDAEは推論のスループット（20倍高速）を大幅に増加させ、再構築品質も向上させます。これらの調査結果は、LDAEをスケーラブルな医療イメージングアプリケーションの有望なフレームワークとして位置付けており、医療画像分析の基礎モデルとして機能する可能性があります。 https://github.com/gabrielelozupone/ldaeで入手可能なコード

要約(オリジナル)

This study presents Latent Diffusion Autoencoder (LDAE), a novel encoder-decoder diffusion-based framework for efficient and meaningful unsupervised learning in medical imaging, focusing on Alzheimer disease (AD) using brain MR from the ADNI database as a case study. Unlike conventional diffusion autoencoders operating in image space, LDAE applies the diffusion process in a compressed latent representation, improving computational efficiency and making 3D medical imaging representation learning tractable. To validate the proposed approach, we explore two key hypotheses: (i) LDAE effectively captures meaningful semantic representations on 3D brain MR associated with AD and ageing, and (ii) LDAE achieves high-quality image generation and reconstruction while being computationally efficient. Experimental results support both hypotheses: (i) linear-probe evaluations demonstrate promising diagnostic performance for AD (ROC-AUC: 90%, ACC: 84%) and age prediction (MAE: 4.1 years, RMSE: 5.2 years); (ii) the learned semantic representations enable attribute manipulation, yielding anatomically plausible modifications; (iii) semantic interpolation experiments show strong reconstruction of missing scans, with SSIM of 0.969 (MSE: 0.0019) for a 6-month gap. Even for longer gaps (24 months), the model maintains robust performance (SSIM > 0.93, MSE < 0.004), indicating an ability to capture temporal progression trends; (iv) compared to conventional diffusion autoencoders, LDAE significantly increases inference throughput (20x faster) while also enhancing reconstruction quality. These findings position LDAE as a promising framework for scalable medical imaging applications, with the potential to serve as a foundation model for medical image analysis. Code available at https://github.com/GabrieleLozupone/LDAE

arxiv情報

著者	Gabriele Lozupone,Alessandro Bria,Francesco Fontanella,Frederick J. A. Meijer,Claudio De Stefano,Henkjan Huisman
発行日	2025-04-11 15:37:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: (Primary), 41A05, 41A10, 65D05, 65D17, cs.CV | コメントを受け付けていません

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

投稿日: 2025年4月14日作成者: jarxiv

要約

テキストからビデオへの最近の進歩（T2V）拡散モデルは、生成されたビデオの視覚的品質を大幅に向上させました。
ただし、最近のT2Vモデルでさえ、特にプロンプトが空間レイアウトまたはオブジェクトの軌跡を正確に制御する必要がある場合、テキストの説明に正確に従うことが難しいと感じています。
最近の研究ラインでは、推論時間中の注意マップの微調整または反復操作が必要なT2Vモデルのレイアウトガイダンスを使用しています。
これにより、メモリの要件が大幅に増加し、バックボーンとして大きなT2Vモデルを採用することが困難になります。
これに対処するために、マルチモーダルの計画と構造化ノイズの初期化に基づいたT2V生成のためのトレーニングなしのガイダンス方法であるVideo-MSGを紹介します。
Video-MSGは3つのステップで構成されており、最初の2つのステップでは、ビデオMSGがビデオスケッチを作成します。これは、ドラフトビデオフレームの形で、背景、前景、およびオブジェクトの軌跡を指定する最終ビデオの微調整された空間的計画を作成します。
最後のステップでは、Video-MSGは、ノイズの反転と除去を介したビデオスケッチを使用して、下流のT2V拡散モデルをガイドします。
特に、Video-MSGでは、推論時間中に追加のメモリを使用した微調整や注意操作は必要ありません。
Video-MSGは、人気のT2V生成ベンチマーク（T2VCompbenchおよびVBench）で複数のT2Vバックボーン（VideoCrafter2およびCogvideox-5B）とのテキストアラインメントの強化における有効性を示しています。
ノイズ反転比、さまざまなバックグラウンドジェネレーター、バックグラウンドオブジェクト検出、および前景オブジェクトセグメンテーションに関する包括的なアブレーション研究を提供します。

要約(オリジナル)

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

arxiv情報

著者	Jialu Li,Shoubin Yu,Han Lin,Jaemin Cho,Jaehong Yoon,Mohit Bansal
発行日	2025-04-11 15:41:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

Title block detection and information extraction for enhanced building drawings search

投稿日: 2025年4月14日作成者: jarxiv

要約

建築、エンジニアリング、および建設（AEC）業界は、建物の建設、メンテナンス、コンプライアンス、エラーチェックのための図面に保存されている情報に依然として大きく依存しています。
ただし、建物の図面からの情報抽出（すなわち）は、特に歴史的な建物を扱う場合、多くの場合、時間がかかり、費用がかかります。
図面検索は、図面のタイトルブロック部分に保存されている情報をレバレッジすることで簡素化できます。これは、図面メタデータと見なすことができます。
ただし、特に均一性のために既存の基準に従わない歴史的図面を扱う場合、タイトルブロックIEは複雑になる可能性があります。
この作業は、この種のIEタスクの既存の方法の比較を実行し、特に複雑で騒々しい歴史的図面を扱うときに、既存の方法を上回るIEパイプラインを提案します。
パイプラインは、軽量の畳み込みニューラルネットワークとGPT-4Oを組み合わせて取得されます。提案された推論パイプラインは、ビルディングエンジニアリングタイトルブロックを高精度で検出し、タイトルブロックから構造化された描画メタデータを抽出します。
この作業は、ベクター（CAD）と手描き（歴史的）図面の両方のIEの精度と効率性を示しています。
抽出された検索のために抽出されたメタデータを活用するユーザーインターフェイス（UI）が確立され、実際のプロジェクトに展開され、大幅な時間節約が示されます。
さらに、将来の作業の基礎を築く効率的なAECに優しい注釈ワークフローを介して、タイトルブロック検出用の拡張可能なドメインエクスペリと発音されたデータセットが開発されます。

要約(オリジナル)

The architecture, engineering, and construction (AEC) industry still heavily relies on information stored in drawings for building construction, maintenance, compliance and error checks. However, information extraction (IE) from building drawings is often time-consuming and costly, especially when dealing with historical buildings. Drawing search can be simplified by leveraging the information stored in the title block portion of the drawing, which can be seen as drawing metadata. However, title block IE can be complex especially when dealing with historical drawings which do not follow existing standards for uniformity. This work performs a comparison of existing methods for this kind of IE task, and then proposes a novel title block detection and IE pipeline which outperforms existing methods, in particular when dealing with complex, noisy historical drawings. The pipeline is obtained by combining a lightweight Convolutional Neural Network and GPT-4o, the proposed inference pipeline detects building engineering title blocks with high accuracy, and then extract structured drawing metadata from the title blocks, which can be used for drawing search, filtering and grouping. The work demonstrates high accuracy and efficiency in IE for both vector (CAD) and hand-drawn (historical) drawings. A user interface (UI) that leverages the extracted metadata for drawing search is established and deployed on real projects, which demonstrates significant time savings. Additionally, an extensible domain-expert-annotated dataset for title block detection is developed, via an efficient AEC-friendly annotation workflow that lays the foundation for future work.

arxiv情報

著者	Alessio Lombardi,Li Duan,Ahmed Elnagar,Ahmed Zaalouk,Khalid Ismail,Edlira Vakaj
発行日	2025-04-11 15:45:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction

投稿日: 2025年4月14日作成者: jarxiv

要約

ロボットは、ボディーランゲージ、動き、発声などの動物の複雑でマルチモーダルなコミュニケーションの合図を解釈するのに苦労しているため、動物とロボットの相互作用（ARI）はロボット工学における未開拓の課題のままです。
確立されたデータセットやフレームワークの恩恵を受ける人間とロボットの相互作用とは異なり、動物とロボットの相互作用には、意味のある双方向コミュニケーションを促進するために必要な基礎リソースがありません。
このギャップを埋めるために、脚のロボットと牛の間の詳細な相互作用をキャプチャする新しいマルチモーダルデータセットであるMBE-ARI（動物とロボットの相互作用におけるマルチモーダル双方向の関与）を提示します。
データセットには、複数の視点からの同期されたRGB-Dストリームが含まれており、相互作用フェーズ全体でボディポーズとアクティビティラベルが注釈されており、ARI研究の前例のないレベルの詳細を提供します。
さらに、四足動物に合わせて調整された全身ポーズ推定モデルを導入します。これは、92.7％の平均平均精度（MAP）で39のキーポイントを追跡でき、動物のポーズ推定で既存のベンチマークを上回ります。
MBE-ARIデータセットとポーズ推定フレームワークは、動物とロボットの相互作用の研究を進めるための堅牢な基盤を築き、ロボットと動物の間の効果的なコラボレーションに必要な知覚、推論、および相互作用フレームワークを開発するための不可欠なツールを提供します。
データセットとリソースは、https://github.com/riselabpurdue/mbe-ari/で公開されており、この重要な分野でさらに探索と開発を招待しています。

要約(オリジナル)

Animal-robot interaction (ARI) remains an unexplored challenge in robotics, as robots struggle to interpret the complex, multimodal communication cues of animals, such as body language, movement, and vocalizations. Unlike human-robot interaction, which benefits from established datasets and frameworks, animal-robot interaction lacks the foundational resources needed to facilitate meaningful bidirectional communication. To bridge this gap, we present the MBE-ARI (Multimodal Bidirectional Engagement in Animal-Robot Interaction), a novel multimodal dataset that captures detailed interactions between a legged robot and cows. The dataset includes synchronized RGB-D streams from multiple viewpoints, annotated with body pose and activity labels across interaction phases, offering an unprecedented level of detail for ARI research. Additionally, we introduce a full-body pose estimation model tailored for quadruped animals, capable of tracking 39 keypoints with a mean average precision (mAP) of 92.7%, outperforming existing benchmarks in animal pose estimation. The MBE-ARI dataset and our pose estimation framework lay a robust foundation for advancing research in animal-robot interaction, providing essential tools for developing perception, reasoning, and interaction frameworks needed for effective collaboration between robots and animals. The dataset and resources are publicly available at https://github.com/RISELabPurdue/MBE-ARI/, inviting further exploration and development in this critical area.

arxiv情報

著者	Ian Noronha,Advait Prasad Jawaji,Juan Camilo Soto,Jiajun An,Yan Gu,Upinder Kaur
発行日	2025-04-11 15:45:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

A Multi-Modal AI System for Screening Mammography: Integrating 2D and 3D Imaging to Improve Breast Cancer Detection in a Prospective Clinical Study

投稿日: 2025年4月14日作成者: jarxiv

要約

デジタル乳房トモシンセシス（DBT）は、フルフィールドデジタルマンモグラフィ（FFDM）にわたる診断パフォーマンスを向上させますが、偽陽性のリコールは乳がんスクリーニングにおいて依然として懸念事項です。
FFDM、合成マンモグラフィ、およびDBTを統合するマルチモーダル人工知能システムを開発し、疑わしい発見の乳房レベルの予測と境界ボックスのローカライズを提供しました。
約500,000のマンモグラフィ試験で訓練されたAIシステムは、内部テストセットで0.945 Aurocを達成しました。
100％の感度を維持しながら、リコールを31.7％、放射線科医のワークロードを43.8％減らす能力を実証し、臨床ワークフローを改善する可能性を強調しました。
外部検証により、強力な一般化可能性が確認され、強力なベースラインと比較して、完全なオーロックへのギャップが35.31％-69.14％減少しました。
18のサイトにわたる前向き展開では、低リスクのケースのリコール率を低下させました。
追加のラベルを使用して750,000を超える試験でトレーニングされた改良バージョンは、大規模な外部データセットでさらにギャップを18.86％-56.62％減らしました。
全体として、これらの結果は、利用可能なすべてのイメージングモダリティを利用することの重要性を強調し、臨床的影響の可能性を示し、大容量ニューラルネットワークを使用する場合のトレーニングセットの増加により、テストエラーをさらに減らすことができることを示しています。

要約(オリジナル)

Although digital breast tomosynthesis (DBT) improves diagnostic performance over full-field digital mammography (FFDM), false-positive recalls remain a concern in breast cancer screening. We developed a multi-modal artificial intelligence system integrating FFDM, synthetic mammography, and DBT to provide breast-level predictions and bounding-box localizations of suspicious findings. Our AI system, trained on approximately 500,000 mammography exams, achieved 0.945 AUROC on an internal test set. It demonstrated capacity to reduce recalls by 31.7% and radiologist workload by 43.8% while maintaining 100% sensitivity, underscoring its potential to improve clinical workflows. External validation confirmed strong generalizability, reducing the gap to a perfect AUROC by 35.31%-69.14% relative to strong baselines. In prospective deployment across 18 sites, the system reduced recall rates for low-risk cases. An improved version, trained on over 750,000 exams with additional labels, further reduced the gap by 18.86%-56.62% across large external datasets. Overall, these results underscore the importance of utilizing all available imaging modalities, demonstrate the potential for clinical impact, and indicate feasibility of further reduction of the test error with increased training set when using large-capacity neural networks.

arxiv情報

著者	Jungkyu Park,Jan Witowski,Yanqi Xu,Hari Trivedi,Judy Gichoya,Beatrice Brown-Mulry,Malte Westerhoff,Linda Moy,Laura Heacock,Alana Lewin,Krzysztof J. Geras
発行日	2025-04-11 15:53:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, eess.IV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント