jarxiv | Japanese arxiv | ページ 528

Deep Representation Learning for Unsupervised Clustering of Myocardial Fiber Trajectories in Cardiac Diffusion Tensor Imaging

投稿日: 2025年5月14日作成者: jarxiv

要約

複雑な心筋アーキテクチャを理解することは、心臓病の診断と治療に不可欠です。
ただし、既存の方法は、特にグラウンドトゥルースラベルの欠如と繊維の軌跡の曖昧で絡み合った性質のために、拡散テンソルイメージング（DTI）データからこの複雑な構造を正確にキャプチャするのに苦労しています。
心筋繊維の監視なしクラスタリングのための新しい深い学習フレームワークを提示し、異なる繊維バンドルを識別するためのデータ駆動型アプローチを提供します。
双方向の長期メモリネットワークを独自に組み合わせて、繊維に沿ったローカルシーケンシャル情報をキャプチャし、変圧器の自動エンコーダーをキャプチャしてグローバルな形状の特徴を学習し、本質的な解剖学的コンテキストをポイントごとに組み込みます。
密度ベースのアルゴリズムを使用してこれらの表現をクラスタリングすると、33〜62の堅牢なクラスターが識別され、さまざまなレベルの粒度を持つ繊維軌道の微妙な区別を正常にキャプチャします。
私たちのフレームワークは、心筋構造を分析するための新しい柔軟で定量的な方法を提供し、私たちの知る限り、以前に達成されていないレベルの描写を実現し、外科的計画の改善、疾患関連のリモデリングの特徴、そして最終的にはパーソナライズされた心臓ケアを前進させる潜在的なアプリケーションを提供します。

要約(オリジナル)

Understanding the complex myocardial architecture is critical for diagnosing and treating heart disease. However, existing methods often struggle to accurately capture this intricate structure from Diffusion Tensor Imaging (DTI) data, particularly due to the lack of ground truth labels and the ambiguous, intertwined nature of fiber trajectories. We present a novel deep learning framework for unsupervised clustering of myocardial fibers, providing a data-driven approach to identifying distinct fiber bundles. We uniquely combine a Bidirectional Long Short-Term Memory network to capture local sequential information along fibers, with a Transformer autoencoder to learn global shape features, with pointwise incorporation of essential anatomical context. Clustering these representations using a density-based algorithm identifies 33 to 62 robust clusters, successfully capturing the subtle distinctions in fiber trajectories with varying levels of granularity. Our framework offers a new, flexible, and quantitative way to analyze myocardial structure, achieving a level of delineation that, to our knowledge, has not been previously achieved, with potential applications in improving surgical planning, characterizing disease-related remodeling, and ultimately, advancing personalized cardiac care.

arxiv情報

著者	Mohini Anand,Xavier Tricoche
発行日	2025-05-13 16:47:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, eess.IV | コメントを受け付けていません

Visual Imitation Enables Contextual Humanoid Control

投稿日: 2025年5月14日作成者: jarxiv

要約

ヒューマノイドに階段を登り、周囲の環境のコンテキストを使用して椅子に座るように教えてください。
間違いなく、最も簡単な方法は、それらを単に人間のモーションビデオをキャプチャして、ヒューマノイドに供給することです。
VideMimicを紹介します。これは、毎日のビデオを採掘し、人間と環境を共同で再構築し、対応するスキルを実行するヒューマノイドロボットの全身制御ポリシーを作成する本物のパイプラインを紹介します。
実際のヒューマノイドロボットでのパイプラインの結果を示し、階段の上昇や下降剤、椅子やベンチからの座って立っているなどの堅牢で再現可能なコンテキスト制御、および環境とグローバルルートコマンドを条件付けられた単一のポリシーからの他のダイナミックな全身スキルを示しています。
VideMimicは、多様な現実世界環境で動作するためにヒューマノイドを教えるためのスケーラブルなパスを提供します。

要約(オリジナル)

How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.

arxiv情報

著者	Arthur Allshire,Hongsuk Choi,Junyi Zhang,David McAllister,Anthony Zhang,Chung Min Kim,Trevor Darrell,Pieter Abbeel,Jitendra Malik,Angjoo Kanazawa
発行日	2025-05-13 16:48:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

投稿日: 2025年5月14日作成者: jarxiv

要約

テキストからビデオへの生成モデルは、テキストプロンプトを動的な視覚コンテンツに変換し、映画制作、ゲーム、教育における幅広いアプリケーションを提供します。
ただし、実際のパフォーマンスはユーザーの期待に達していないことがよくあります。
重要な理由の1つは、これらのモデルがユーザーが作成したいトピックに関連するビデオでトレーニングされていないことです。
この論文では、実際のシナリオでユーザーの焦点に合わせてキュレートされた最初のビデオデータセットであるVideoufoを提案します。
これを超えて、私たちのVideoufoは、（1）既存のビデオデータセットと重複する最小（0.29％）、および（2）Creative Commonsライセンスの下でYouTubeの公式APIを介して独占的に検索されたビデオも機能しています。
これらの2つの属性は、将来の研究者に、トレーニングソースを広げる自由をより強く提供します。
Videoufoは、109万を超えるビデオクリップで構成されており、それぞれが簡単なキャプションと詳細なキャプション（説明）の両方と組み合わされています。
具体的には、クラスタリングを通じて、最初に、100万スケールの実際のテキストからビデオへのプロンプトデータセットであるVidpromから1,291のユーザー中心のトピックを特定します。
次に、これらのトピックを使用して、YouTubeからビデオを取得し、取得したビデオをクリップに分割し、各クリップの簡単なキャプションと詳細なキャプションの両方を生成します。
指定されたトピックを使用してクリップを確認した後、約109万のビデオクリップが残ります。
私たちの実験は、（1）現在の16のテキストからビデオへのモデルが、すべてのユーザー中心のトピックで一貫したパフォーマンスを達成していないことを明らかにしています。
（2）Videoufoで訓練された単純なモデルは、最悪のパフォーマンスのトピックについて他の人を上回ります。
データセットとコードは、https：//huggingface.co/datasets/wenhaowang/videoufoおよびhttps://github.com/wangwenhao0716/benchufoで、4.0ライセンスでCCで公開されています。

要約(オリジナル)

Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users’ FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube’s official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset and code are publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO and https://github.com/WangWenhao0716/BenchUFO under the CC BY 4.0 License.

arxiv情報

著者	Wenhao Wang,Yi Yang
発行日	2025-05-13 16:54:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

投稿日: 2025年5月14日作成者: jarxiv

要約

栄養推定は、健康的な食事を促進し、食事関連の健康リスクを軽減する重要な要素です。
食品分類や成分認識などのタスクの進歩にもかかわらず、栄養注釈付きのデータセットが不足しているため、栄養推定の進歩は限られています。
この問題に対処するために、908のファーストフードカテゴリに84,446個の画像を備えたデータセットであるFastFoodを紹介し、成分と栄養注釈を備えています。
さらに、視覚的および成分の特徴を統合することにより栄養推定を強化するために、新しいモデルに依存しない視覚的に関与した特徴融合（VIF $^2 $）メソッドを提案します。
成分の堅牢性は、トレーニング中に同義語の交換と再サンプリング戦略を通じて改善されます。
成分を認識した視覚特徴融合モジュールは、成分の特徴と視覚表現を組み合わせて、正確な栄養予測を実現します。
テスト中、成分の予測は、データの増強と多数決により、大きなマルチモーダルモデルを使用して洗練されます。
FastFoodとNutrition5Kデータセットの両方での実験は、さまざまなバックボーン（例：ResNet、InceptionV3、VIT）に組み込まれた提案方法の有効性を検証します。これは、栄養推定における成分情報の重要性を示しています。
https://huiyanqi.github.io/fastfood-nutrition-stimation/。

要約(オリジナル)

Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.

arxiv情報

著者	Huiyan Qi,Bin Zhu,Chong-Wah Ngo,Jingjing Chen,Ee-Peng Lim
発行日	2025-05-13 17:01:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Aya Vision: Advancing the Frontier of Multilingual Multimodality

投稿日: 2025年5月14日作成者: jarxiv

要約

マルチモーダル言語モデルの構築は根本的に困難です。ビジョンと言語のモダリティを調整し、高品質の指導データをキュレーションし、ビジョンが導入されたら既存のテキストのみの機能の劣化を回避する必要があります。
これらの困難は、さまざまな言語でマルチモーダルデータの必要性が既存のデータ不足を悪化させ、機械の翻訳が意味を歪め、壊滅的な忘却がより顕著になることが多い多言語設定でさらに拡大されます。
前述の課題に対処するために、データとモデリングの両方にまたがる新しい手法を紹介します。
まず、高品質で多様な多言語マルチモーダル命令データをキュレートする合成アノテーションフレームワークを開発し、AYAビジョンモデルが多くの言語にわたるマルチモーダル入力に対する自然でヒトが優先される応答を生成できるようにします。
これを補完すると、壊滅的な忘却を緩和し、テキストのみの機能を効果的に保存しながら、マルチモーダル生成パフォーマンスを強化するクロスモーダルモデルの合併手法を提案します。
AYA-Vision-8Bは、QWEN-2.5-VL-7B、PIXTRAL-12B、さらにはるかに大きなLlama-3.2-90B-Visionなどの強力なマルチモーダルモデルと比較して、クラス最高のパフォーマンスを実現します。
さらに、Molmo-72BやLlama-3.2-90B-Visionなどのサイズの2倍以上のモデルよりも優れたAya-Vision-32Bでこのアプローチを拡大します。
私たちの仕事は、マルチモーダルフロンティアで多言語の進歩を進め、非常に高いパフォーマンスを提供しながら、計算の必要性を効果的に曲げる技術に関する洞察を提供します。

要約(オリジナル)

Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.

arxiv情報

著者	Saurabh Dash,Yiyang Nan,John Dang,Arash Ahmadian,Shivalika Singh,Madeline Smith,Bharat Venkitesh,Vlad Shmyhlo,Viraat Aryabumi,Walter Beller-Morales,Jeremy Pekmez,Jason Ozuzu,Pierre Richemond,Acyr Locatelli,Nick Frosst,Phil Blunsom,Aidan Gomez,Ivan Zhang,Marzieh Fadaee,Manoj Govindassamy,Sudip Roy,Matthias Gallé,Beyza Ermis,Ahmet Üstün,Sara Hooker
発行日	2025-05-13 17:03:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV, cs.LG | コメントを受け付けていません

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

投稿日: 2025年5月14日作成者: jarxiv

要約

都市環境での航空視覚オブジェクト検索（AVOS）タスクでは、外部ガイダンスなしで視覚的およびテキストキューを使用してターゲットオブジェクトを自律的に検索および識別するために、無人航空機（UAV）が必要です。
既存のアプローチは、冗長なセマンティック処理、同様のオブジェクトの区別、および探査爆発のジレンマのために、複雑な都市環境で闘っています。
このギャップを埋め、AVOSタスクをサポートするために、一般的な都市オブジェクトの自律検索のための最初のベンチマークデータセットであるCityAvosを紹介します。
このデータセットは、さまざまな難易度レベルを持つ6つのオブジェクトカテゴリにわたる2,420のタスクで構成されており、UAVエージェントの検索機能を包括的に評価できます。
AVOSタスクを解決するために、人間の3層認知を模倣するマルチモーダル大手言語モデル（MLLM）を搭載した新しいエージェントメソッドであるPrpsearcher（知覚リアーズリング計画検索者）も提案します。
具体的には、PRPSearcherは、3つの特殊なマップを構築します。オブジェクト中心の動的セマンティックマップ空間知覚を強化する、ターゲット推論のセマンティックアトラクション値に基づく3D認知マップ、およびバランスの取れた探索 – 爆発検索の3D不確実性マップ。
また、私たちのアプローチには、類似のオブジェクトからの干渉を緩和するための除去メカニズムが組み込まれ、インスピレーション促進思想（IPT）促進メカニズムのインスピレーションを利用します。
CityAvosの実験結果は、Prpsearcherが成功率と検索効率の両方で既存のベースラインを上回ることを示しています（平均： +37.69％SR、 +28.96％SPL、-30.69％MSS、および-46.40％NE）。
有望である一方で、人間と比較したパフォーマンスのギャップは、AVOSタスクにおけるより良い意味的推論と空間的探査機能の必要性を強調しています。
この作業は、具体化されたターゲット検索における将来の進歩の基盤を確立します。
データセットとソースコードは、https：//anonymous.4open.science/r/cityavos-3df8で入手できます。

要約(オリジナル)

Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents’ search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at https://anonymous.4open.science/r/CityAVOS-3DF8.

arxiv情報

著者	Yatai Ji,Zhengqiu Zhu,Yong Zhao,Beidan Liu,Chen Gao,Yihao Zhao,Sihang Qiu,Yue Hu,Quanjun Yin,Yong Li
発行日	2025-05-13 17:34:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

投稿日: 2025年5月14日作成者: jarxiv

要約

拡散変圧器（DIT）は生成タスクに優れていますが、推論コストが高いため、実用的な展開の課題に直面しています。
冗長計算を保存および取得する機能キャッシュは、加速の可能性を提供します。
既存の学習ベースのキャッシュは、適応性があるものの、以前のタイムステップの影響を見落としています。
また、トレーニングと推論の間に、誤った整列された目標（予測されるノイズ対高品質の画像が整合されている）にも苦しんでいます。
これらの2つの矛盾は、パフォーマンスと効率の両方を損ないます。
この目的のために、私たちはトレーニングと推論を、ハーモニカと呼ばれる新しい学習ベースのキャッシュフレームワークと調和させます。
まず、段階的な除去トレーニング（SDT）が組み込まれて、以前のステップを活用できる除去プロセスの連続性を確保します。
さらに、画像エラーのプロキシガイド目標（IEPO）が適用され、画像エラーを近似するために効率的なプロキシを介して画像品質のバランスをとります。
8ドルのモデル、4ドルのサンプラー、256ドルのTimes256 $から2K $の解像度にまたがる広範な実験は、フレームワークの優れたパフォーマンスとスピードアップを示しています。
たとえば、40ドル以上のレイテンシの削減（つまり、$ 2.07 \ Times $の理論的スピードアップ）を達成し、Pixart-$ \ Alpha $のパフォーマンスを向上させます。
驚くべきことに、私たちの画像のないアプローチにより、トレーニング時間は以前の方法と比較して25ドル\％$を短縮します。
私たちのコードは、https：//github.com/modeltc/harmonicaで入手できます。

要約(オリジナル)

Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives–aligned predicted noise vs. high-quality images–between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\%$ latency reduction (i.e., $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, our image-free approach reduces training time by $25\%$ compared with the previous method. Our code is available at https://github.com/ModelTC/HarmoniCa.

arxiv情報

著者	Yushi Huang,Zining Wang,Ruihao Gong,Jing Liu,Xinjie Zhang,Jinyang Guo,Xianglong Liu,Jun Zhang
発行日	2025-05-13 17:43:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Breast Cancer Histopathology Classification using CBAM-EfficientNetV2 with Transfer Learning

投稿日: 2025年5月14日作成者: jarxiv

要約

乳がんの組織病理学の画像分類は、早期発見と患者の転帰の改善に重要です。
1この研究では、特徴抽出を改善し、関連する組織領域に焦点を当てるために、EfficientNetv2モデルを活用する新しいアプローチを紹介します。
提案されたモデルは、複数の倍率スケール（40x、100x、200x、および400x）にわたってBreakHisデータセットで評価されました。
2の中で、CBAMのEfficientNetv2-XLは優れたパフォーマンスを達成し、99.01％のピーク精度と400倍の倍率で98.31％のF1スコアに達し、最先端の方法を上回りました。
3コントラスト限定された適応ヒストグラムイコライゼーション（CLAHE）を前処理と計算効率の最適化のために統合することにより、この方法はリアルタイムの臨床展開に対する適合性を示しています。
3結果は、乳がん検出の診断精度を進める上で、注意強化されたスケーラブルなアーキテクチャの可能性を強調しています。

要約(オリジナル)

Breast cancer histopathology image classification is critical for early detection and improved patient outcomes. 1 This study introduces a novel approach leveraging EfficientNetV2 models, to improve feature extraction and focus on relevant tissue regions. The proposed models were evaluated on the BreakHis dataset across multiple magnification scales (40X, 100X, 200X, and 400X). 2 Among them, the EfficientNetV2-XL with CBAM achieved outstanding performance, reaching a peak accuracy of 99.01 percent and an F1-score of 98.31 percent at 400X magnification, outperforming state-of-the-art methods. 3 By integrating Contrast Limited Adaptive Histogram Equalization (CLAHE) for preprocessing and optimizing computational efficiency, this method demonstrates its suitability for real-time clinical deployment. 3 The results underscore the potential of attention-enhanced scalable architectures in advancing diagnostic precision for breast cancer detection.

arxiv情報

著者	Naren Sengodan
発行日	2025-05-13 17:49:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, eess.IV | コメントを受け付けていません

Efficient Adaptation For Remote Sensing Visual Grounding

投稿日: 2025年5月14日作成者: jarxiv

要約

事前に訓練されたモデルの適応は、人工知能における効果的な戦略となっており、ゼロからトレーニングモデルに代わるスケーラブルで効率的な代替品を提供します。
視覚的接地（VG）が露出度の低いままであるリモートセンシング（RS）のコンテキストでは、このアプローチにより、強力な視覚言語モデルの展開が堅牢なクロスモーダル理解を実現しながら、計算オーバーヘッドを大幅に削減できます。
これに対処するために、パラメーター効率的な微調整（PEFT）手法を適用して、これらのモデルをRS固有のVGタスクに適応させました。
具体的には、dinoの接地のさまざまなモジュールにわたってロラ配置を評価し、bitfitとアダプターを使用して、汎用VGデータセットで事前に訓練されたOFAファンデーションモデルを微調整しました。
このアプローチは、計算コストを大幅に削減しながら、現在のART最新モデル（SOTA）モデルに匹敵する、またはそれを超えるパフォーマンスを達成しました。
この研究では、RSの効率的かつ正確なマルチモーダル分析を進めるためのPEFT技術の可能性を強調しており、完全なモデルトレーニングに代わる実用的で費用対効果の高い代替品を提供します。

要約(オリジナル)

Adapting pre-trained models has become an effective strategy in artificial intelligence, offering a scalable and efficient alternative to training models from scratch. In the context of remote sensing (RS), where visual grounding(VG) remains underexplored, this approach enables the deployment of powerful vision-language models to achieve robust cross-modal understanding while significantly reducing computational overhead. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.

arxiv情報

著者	Hasan Moughnieh,Mohamad Chalhoub,Hasan Nasrallah,Cristiano Nattero,Paolo Campanella,Giovanni Nico,Ali J. Ghandour
発行日	2025-05-13 17:53:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

投稿日: 2025年5月14日作成者: jarxiv

要約

模倣は人間の基本的な学習メカニズムであり、個人が専門家を観察し模倣することで新しいタスクを学ぶことができます。
ただし、ロボットにこの能力を適用すると、視覚的外観と物理的能力の両方において、人間とロボットの実施形態の固有の違いがあるため、重要な課題があります。
以前の方法は、共有シーンやタスクを使用して交差体拡大データセットを使用してこのギャップを埋めますが、人間とロボットの間にこのような整合したデータを大規模に収集することは些細なことではありません。
このホワイトペーパーでは、ラベルなしで大規模なクロスエンボジメントビデオデータから具体化されたスキル表現を学習する新しいフレームワークであるUniskillを提案し、ヒューマンビデオプロンプトから抽出されたスキルを可能にし、ロボットデータでのみトレーニングされたロボットポリシーに効果的に転送できます。
シミュレーションと現実世界の両方の環境での実験は、目に見えないビデオプロンプトがあっても、適切なアクションの選択にロボットを採用することに成功していることを示しています。
プロジェクトのWebサイトは、https：//kimhanjung.github.io/uniskillにあります。

要約(オリジナル)

Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: https://kimhanjung.github.io/UniSkill.

arxiv情報

著者	Hanjung Kim,Jaehyun Kang,Hyolim Kang,Meedeum Cho,Seon Joo Kim,Youngwoon Lee
発行日	2025-05-13 17:59:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント