jarxiv | Japanese arxiv

Real-time Seafloor Segmentation and Mapping

投稿日: 2025年6月17日作成者: jarxiv

要約

Posidonia Oceanica Meadowsは、生存と保全のために岩に大きく依存している海草の種です。
近年、この種の世界的な減少に関する懸念があり、効率的な監視および評価ツールの重要な必要性を強調しています。
ディープラーニングベースのセマンティックセグメンテーションと視覚自動監視システムは、さまざまなアプリケーションで有望であることが示されていますが、水中環境でのパフォーマンスは、複雑な水条件と限られたデータセットのために依然として困難です。
このペーパーでは、機械学習とコンピュータービジョンの技術を組み合わせて、自律型水中車両（AUV）がPosidonia Oceanica Meadowsの境界を自律的に検査できるようにするフレームワークを紹介します。
このフレームワークには、既存のマスクR-CNNモデルとPosidonia Oceanica Meadow境界追跡のための戦略を使用して、画像セグメンテーションモジュールが組み込まれています。
さらに、岩に特化した新しいクラスが導入され、包括的な監視アプローチに貢献し、牧草地とその周辺の環境との複雑な相互作用をより深く理解することを目指しています。
画像セグメンテーションモデルは、実際の水中画像を使用して検証されますが、全体的な検査フレームワークは現実的なシミュレーション環境で評価され、実際の水中画像で実際の監視シナリオを複製します。
結果は、提案されたフレームワークにより、AUVが水中検査と岩石のセグメンテーションの主なタスクを自律的に達成できることを示しています。
その結果、この作業は、海洋環境の保全と保護の重要な可能性を秘めており、Posidonia Oceanica Meadowsの状況に関する貴重な洞察を提供し、標的を絞った保存努力をサポートしています。

要約(オリジナル)

Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts

arxiv情報

著者	Michele Grimaldi,Nouf Alkaabi,Francesco Ruscio,Sebastian Realpe Rua,Rafael Garcia,Nuno Gracias
発行日	2025-06-16 11:32:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Generative Representational Learning of Foundation Models for Recommendation

投稿日: 2025年6月17日作成者: jarxiv

要約

多様なタスクを越えて優れた能力を備えた単一の基礎モデルを開発することは、人工知能の分野で長年の目標となっています。
汎用の基礎モデルの波がさまざまなドメインを掃除すると、その影響は推奨システムの分野に大きく拡大しました。
最近の努力により、さまざまな生成タスクの推奨ファンデーションモデルが調査されていますが、多くの場合、タスクを埋め込む重要な埋め込みを見落とし、知識の共有と紛争解決、収束速度の矛盾など、マルチタスク学習の複雑さと闘っています。
これらの制限に対処するために、推奨基盤モデルのための生成的表現学習フレームワークであるRecFoundを紹介します。
多様なシナリオ全体に生成タスクと埋め込みタスクの両方をカバーする推奨ファンデーションモデルのための最初の包括的なデータセットを構築します。
このデータセットに基づいて、知識の共有と競合を処理するための低ランクの専門家（TMOLE）のタスクごとの混合物、段階的な収束指向のサンプルスケジューラ（S2Sched）を備えた新しいマルチタスクトレーニングスキームを提案し、一貫性のない収束に対処し、モデルがモジュールをマージしてパフォーマンスをバランスさせます。
実験では、開封がさまざまな推奨タスクにわたって最先端のパフォーマンスを達成し、既存のベースラインを上回ることが示されています。

要約(オリジナル)

Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.

arxiv情報

著者	Zheli Zhou,Chenxu Zhu,Jianghao Lin,Bo Chen,Ruiming Tang,Weinan Zhang,Yong Yu
発行日	2025-06-16 03:10:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.IR | コメントを受け付けていません

Specification and Evaluation of Multi-Agent LLM Systems — Prototype and Cybersecurity Applications

投稿日: 2025年6月17日作成者: jarxiv

要約

LLMの最近の進歩は、たとえば、最新のOpenaiおよびDeepseekモデルの推論能力を通じて、新しいアプリケーションの可能性を示しています。
これらのモデルをテキスト生成を超えて特定のドメインに適用するために、LLMベースのマルチエージェントアプローチを利用して、推論技術、コード生成、およびソフトウェアの実行を組み合わせて複雑なタスクを解決できます。
アプリケーションは、これらの機能と専門のLLMエージェントの知識を利用する場合があります。
ただし、多くの評価はLLMS、推論技術、およびアプリケーションで個別に実行されますが、その共同仕様と組み合わせアプリケーションは十分に調査されていません。
マルチエージェントLLMシステムの定義された仕様は、LLMS、推論技術、および関連する側面の体系的な評価を可能にする可能性と特定のアプリケーションへの適合性を調査するために必要です。
このペーパーでは、探索的研究の結果を報告して、マルチエージェントシステムを介してこれらの側面を指定および評価します。
システムアーキテクチャとプロトタイプは以前の研究から拡張されており、マルチエージェントシステム用の仕様が導入されています。
サイバーセキュリティタスクを含むテストケースは、アーキテクチャおよび評価アプローチの実現可能性を示しています。
特に、結果は、OpenAIおよびDeepSeekのLLMSを使用してエージェントによって正しく完了した質問応答、サーバーセキュリティ、およびネットワークセキュリティタスクの評価を示しています。

要約(オリジナル)

Recent advancements in LLMs indicate potential for novel applications, e.g., through reasoning capabilities in the latest OpenAI and DeepSeek models. For applying these models in specific domains beyond text generation, LLM-based multi-agent approaches can be utilized that solve complex tasks by combining reasoning techniques, code generation, and software execution. Applications might utilize these capabilities and the knowledge of specialized LLM agents. However, while many evaluations are performed on LLMs, reasoning techniques, and applications individually, their joint specification and combined application is not explored well. Defined specifications for multi-agent LLM systems are required to explore their potential and their suitability for specific applications, allowing for systematic evaluations of LLMs, reasoning techniques, and related aspects. This paper reports the results of exploratory research to specify and evaluate these aspects through a multi-agent system. The system architecture and prototype are extended from previous research and a specification is introduced for multi-agent systems. Test cases involving cybersecurity tasks indicate feasibility of the architecture and evaluation approach. In particular, the results show the evaluation of question answering, server security, and network security tasks that were completed correctly by agents with LLMs from OpenAI and DeepSeek.

arxiv情報

著者	Felix Härer
発行日	2025-06-16 05:03:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: 68T01, cs.AI, cs.CR, I.2.1 | コメントを受け付けていません

Evaluating Sensitivity Parameters in Smartphone-Based Gaze Estimation: A Comparative Study of Appearance-Based and Infrared Eye Trackers

投稿日: 2025年6月17日作成者: jarxiv

要約

この研究では、パフォーマンスを商用赤外線ベースのアイトラッカーであるTobii Pro Nanoと比較することにより、スマートフォンベースの深い学習目の視線アルゴリズムを評価します。
目的は、現実的なモバイル使用条件下での外観ベースの視線推定の実現可能性を調査することです。
年齢、性別、視力補正、照明条件、デバイスの種類、ヘッド位置などの主要な感度因子を体系的に分析しました。
外観ベースのアルゴリズムは、軽量の畳み込みニューラルネットワーク（MobileNet-V3）を再発構造（長期短期メモリ）と統合して、グレースケールのフェイシャル画像の視線座標を予測します。
動的視覚刺激を使用して51人の参加者から視線データを収集し、ユークリッド距離を使用して精度を測定しました。
深い学習モデルは、Tobii Pro Nanoの16.53 mmと比較して、17.76 mmの平均誤差を生成しました。
全体的な精度の違いは小さかったが、深い学習ベースの方法は、照明、視力補正、年齢などの要因により敏感であり、メガネを使用した参加者や高齢者グループの低光条件下で観察される故障率が高い。
デバイス固有および位置要因も追跡性能に影響を与えました。
これらの結果は、モバイルアイトラッキングの外観ベースのアプローチの可能性を強調し、さまざまな使用状況にわたって視線推定システムを評価するための参照フレームワークを提供します。

要約(オリジナル)

This study evaluates a smartphone-based, deep-learning eye-tracking algorithm by comparing its performance against a commercial infrared-based eye tracker, the Tobii Pro Nano. The aim is to investigate the feasibility of appearance-based gaze estimation under realistic mobile usage conditions. Key sensitivity factors, including age, gender, vision correction, lighting conditions, device type, and head position, were systematically analysed. The appearance-based algorithm integrates a lightweight convolutional neural network (MobileNet-V3) with a recurrent structure (Long Short-Term Memory) to predict gaze coordinates from grayscale facial images. Gaze data were collected from 51 participants using dynamic visual stimuli, and accuracy was measured using Euclidean distance. The deep learning model produced a mean error of 17.76 mm, compared to 16.53 mm for the Tobii Pro Nano. While overall accuracy differences were small, the deep learning-based method was more sensitive to factors such as lighting, vision correction, and age, with higher failure rates observed under low-light conditions among participants using glasses and in older age groups. Device-specific and positional factors also influenced tracking performance. These results highlight the potential of appearance-based approaches for mobile eye tracking and offer a reference framework for evaluating gaze estimation systems across varied usage conditions.

arxiv情報

著者	Nishan Gunawardena,Gough Yumu Lui,Bahman Javadi,Jeewani Anupama Ginige
発行日	2025-06-16 05:38:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.HC | コメントを受け付けていません

VGR: Visual Grounded Reasoning

投稿日: 2025年6月17日作成者: jarxiv

要約

マルチモーダルの考え方（COT）の推論の分野では、既存のアプローチは主に言語バイアスに苦しみ、数学または科学のドメインに主に限定されている純粋な言語空間の推論に依存しています。
この狭い焦点は、画像の詳細の包括的な理解を必要とする複雑な視覚的推論タスクを処理する能力を制限します。
これらの制限に対処するために、このペーパーでは、微調整された視覚認識能力を強化した斬新なマルチモーダル大手言語モデル（MLLM）であるVGRを紹介します。
言語空間だけに質問や推論に答える従来のMLLMとは異なり、VGRは最初に問題を解決するのに役立つ関連領域を検出し、再生された画像領域に基づいて正確な回答を提供します。
これを達成するために、視力の接地と言語控除を混合した推論データを含むVgr -SFTと呼ばれる大規模なSFTデータセットを実施します。
VGRの推論パイプラインにより、モデルは視覚的な参照用の境界ボックスを選択でき、リプレイ段階が導入され、対応する領域を推論プロセスに統合してマルチモデルの理解を高めます。
LLAVA-Next-7Bベースラインの実験は、VGRがマルチモーダルベンチマークで優れたパフォーマンスを達成し、包括的な画像の詳細理解を必要とすることを示しています。
ベースラインと比較して、VGRは画像トークンカウントの30 \％のみを使用しながら、MMSTARで+4.1、AI2Dで+7.1、Chartqaで+12.9の改善を提供します。

要約(オリジナル)

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

arxiv情報

著者	Jiacong Wang,Zijian Kang,Haochen Wang,Haiyong Jiang,Jiawen Li,Bohong Wu,Ya Wang,Jiao Ran,Xiao Liang,Chao Feng,Jun Xiao
発行日	2025-06-16 07:35:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments

投稿日: 2025年6月17日作成者: jarxiv

要約

堅牢で一般化可能なスマートホームベースのヒューマンアクティビティ認識（HAR）システムの開発における主要な障害は、大規模で多様なラベル付きデータセットがないことです。
個人がさまざまなルーチンに従い、異なる方法でアクティビティを実行するため、ホームレイアウト、センサーの構成、およびユーザーの動作の変動性がさらに複雑になります。
よく一般化するHARシステムの構築には、ユーザーと環境全体の多様性をキャプチャするトレーニングデータが必要です。
これらの課題に対処するために、大規模な言語モデルを活用することで多様なペルソナが生成される仮想データ生成パイプラインであるAgentsenseを紹介します。
これらのペルソナは、毎日のルーチンを作成するために使用され、その後、低レベルのアクションシーケンスに分解されます。
その後、アクションは、VirtualHomeと呼ばれるシミュレートされたホーム環境で実行され、エージェントアクティビティが展開されるときに記録できる仮想アンビエントセンサーで拡張しました。
全体として、Agentsenseは、幅広いユーザーとホーム設定を表すリッチで仮想センサーデータセットの生成を可能にします。
5つのベンチマークHARデータセットで、仮想センサーデータを活用すると、特に実際のデータが制限されている場合、パフォーマンスが大幅に向上することが示されています。
特に、仮想データとほんの数日間の実際のデータの組み合わせでトレーニングされたモデルは、実際のデータセット全体でトレーニングされたものに匹敵するパフォーマンスを実現します。
これらの結果は、マニュアルデータ収集の取り組みを必要とせずに大規模で注釈付きのデータセットの明確な欠如である、周囲のセンシングにおける最も差し迫った課題の1つに対処する仮想データの可能性を実証および証明しています。

要約(オリジナル)

A major obstacle in developing robust and generalizable smart home-based Human Activity Recognition (HAR) systems is the lack of large-scale, diverse labeled datasets. Variability in home layouts, sensor configurations, and user behavior adds further complexity, as individuals follow varied routines and perform activities in distinct ways. Building HAR systems that generalize well requires training data that captures the diversity across users and environments. To address these challenges, we introduce AgentSense, a virtual data generation pipeline where diverse personas are generated by leveraging Large Language Models. These personas are used to create daily routines, which are then decomposed into low-level action sequences. Subsequently, the actions are executed in a simulated home environment called VirtualHome that we extended with virtual ambient sensors capable of recording the agents activities as they unfold. Overall, AgentSense enables the generation of rich, virtual sensor datasets that represent a wide range of users and home settings. Across five benchmark HAR datasets, we show that leveraging our virtual sensor data substantially improves performance, particularly when real data are limited. Notably, models trained on a combination of virtual data and just a few days of real data achieve performance comparable to those trained on the entire real datasets. These results demonstrate and prove the potential of virtual data to address one of the most pressing challenges in ambient sensing, which is the distinct lack of large-scale, annotated datasets without requiring any manual data collection efforts.

arxiv情報

著者	Zikang Leng,Megha Thukral,Yaqi Liu,Hrudhai Rajasekhar,Shruthi K. Hiremath,Thomas Plötz
発行日	2025-06-16 01:17:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.HC | コメントを受け付けていません

Learning Multimodal Latent Dynamics for Human-Robot Interaction

投稿日: 2025年6月16日作成者: jarxiv

要約

この記事では、ヒトと人間の相互作用（HHI）から適切に調整された人間のロボット相互作用（HRI）を学習する方法を紹介します。
Hidden Markov Models（HMMS）を使用して、相互作用中のエージェント上の共同分布をモデル化するための潜在的な空間前の潜在空間としてのハイブリッドアプローチを考案します。
HHIから学んだ相互作用のダイナミクスを活用してHRIを学習し、条件付きのロボット運動の条件付き生成をトレーニングに組み込み、それにより、より正確なロボットの軌跡を予測します。
生成されたロボットの動きは、逆運動学にさらに適合して、人間との望ましい物理的近接性を確保し、ジョイントスペースの学習の容易さと正確なタスクスペースの到達可能性を組み合わせています。
接触が豊富な相互作用のために、HMMセグメンテーションを使用して準拠した相互作用を使用してロボットの剛性を調節します。
ユーザー調査を介してヒューマノイドロボットに展開されたアプローチの有効性を確認します。
私たちの方法は、わずか2人の人間からのデータについて訓練されているにもかかわらず、さまざまな人間によく一般的です。
ユーザーは、私たちの方法を、より人間のような、タイムリーで、正確であると認識し、他のベースラインよりも高い程度の好みで方法をランク付けすることがわかります。
さらに、バイナスのロボットから人間のハンドオーバーのより複雑なシナリオで、成功した相互作用を生成するアプローチの能力を示しています。

要約(オリジナル)

This article presents a method for learning well-coordinated Human-Robot Interaction (HRI) from Human-Human Interactions (HHI). We devise a hybrid approach using Hidden Markov Models (HMMs) as the latent space priors for a Variational Autoencoder to model a joint distribution over the interacting agents. We leverage the interaction dynamics learned from HHI to learn HRI and incorporate the conditional generation of robot motions from human observations into the training, thereby predicting more accurate robot trajectories. The generated robot motions are further adapted with Inverse Kinematics to ensure the desired physical proximity with a human, combining the ease of joint space learning and accurate task space reachability. For contact-rich interactions, we modulate the robot’s stiffness using HMM segmentation for a compliant interaction. We verify the effectiveness of our approach deployed on a Humanoid robot via a user study. Our method generalizes well to various humans despite being trained on data from just two humans. We find that users perceive our method as more human-like, timely, and accurate and rank our method with a higher degree of preference over other baselines. We additionally show the ability of our approach to generate successful interactions in a more complex scenario of Bimanual Robot-to-Human Handovers.

arxiv情報

著者	Vignesh Prasad,Lea Heitlinger,Dorothea Koert,Ruth Stock-Homburg,Jan Peters,Georgia Chalvatzaki
発行日	2025-06-12 18:59:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.HC, cs.LG, cs.RO | コメントを受け付けていません

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving

投稿日: 2025年6月16日作成者: jarxiv

要約

ロングテールドライビングシナリオでエンドツーエンドの自律運転に合わせて調整された3Bパラメータービジョン言語モデル（VLM）であるPoutineを紹介します。
プーチンは2つの段階で訓練されています。
強力なベース駆動能力を得るために、83時間のコブラ名目運転と11時間のWaymoロングテールドライビングで、自己監視視覚様式の挑発（VLT）の次のトークン予測ファッションでプーチンベースを訓練します。
付随する言語注釈は、72BパラメーターVLMで自動生成されます。
Poutineは、WAYMO検証セットから500枚の優先順位標識フレーム未満を使用して、グループ相対ポリシー最適化（GRPO）を備えた微調整Poutine-Baseによって取得されます。
VLTプレイトレーニングとRL微調整の両方が、ロングテールで強力な運転性能を達成するために重要であることを示しています。
Poutine-Baseは、検証セットで8.12の評価者フィードバックスコア（RFS）を達成し、Waymoの専門家のグラウンドトゥルースRFとほぼ一致します。
最終的なPoutineモデルは、公式のWaymoテストセットで7.99のRFSを達成し、2025年のWaymo Visionベースのエンドツーエンドのドライビングチャレンジで大きなマージンで1位になりました。
これらの結果は、堅牢で一般化可能な自律性を有効にするために、スケーラブルなVLTプリトレーニングおよび軽量RL微調整の約束を強調しています。

要約(オリジナル)

We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios. Poutine is trained in two stages. To obtain strong base driving capabilities, we train Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo long-tail driving. Accompanying language annotations are auto-generated with a 72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group Relative Policy Optimization (GRPO) using less than 500 preference-labeled frames from the Waymo validation set. We show that both VLT pretraining and RL fine-tuning are critical to attain strong driving performance in the long-tail. Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation set, nearly matching Waymo’s expert ground-truth RFS. The final Poutine model achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These results highlight the promise of scalable VLT pre-training and lightweight RL fine-tuning to enable robust and generalizable autonomy.

arxiv情報

著者	Luke Rowe,Rodrigue de Schaetzen,Roger Girgis,Christopher Pal,Liam Paull
発行日	2025-06-12 19:14:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

投稿日: 2025年6月16日作成者: jarxiv

要約

ロボット操作は、多様な言語指示によって指定された目に見えないオブジェクト、環境、およびタスク全体で一般化する上で重要な課題に直面しています。
一般化能力を改善するために、最近の研究では、計画と行動の実行のために大規模な言語モデル（LLM）が組み込まれています。
有望である一方で、これらの方法は視覚環境で根拠のある計画を生成するのに不足していることがよくあります。
ロボット操作のためにLLMSで視覚的な指導チューニングを実行する努力がなされていますが、既存の方法は通常、シングルビュー画像入力によって制約され、正確なオブジェクトの接地との闘いがあります。
この作業では、一般化可能なロボット操作のためにLLMSに基づいた新しい接地された視覚言語計画モデルであるGondolaを紹介します。
Gondolaは、ターゲットオブジェクトと場所のインターリーブテキストとセグメンテーションマスクを使用して、マルチビュー画像と履歴計画を作成して、次のアクションプランを作成します。
Gondolaのトレーニングをサポートするために、RLBenchシミュレーター、つまりロボット接地計画、式を参照するマルチビュー、および擬似ホリゾンタスクデータセットを使用して、3種類のデータセットを構築します。
Gondolaは、新しい配置、剛性オブジェクト、明確なオブジェクト、長距離タスクなど、Gembenchデータセットの4つの一般化レベルすべてにわたって、最先端のLLMベースのメソッドよりも優れています。

要約(オリジナル)

Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.

arxiv情報

著者	Shizhe Chen,Ricardo Garcia,Paul Pacaud,Cordelia Schmid
発行日	2025-06-12 20:04:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.RO | コメントを受け付けていません

Demonstration Sidetracks: Categorizing Systematic Non-Optimality in Human Demonstrations

投稿日: 2025年6月16日作成者: jarxiv

要約

デモンストレーション（LFD）から学ぶことは、ロボットが新しいスキルを獲得するための一般的なアプローチですが、ほとんどのLFDメソッドは人間のデモンストレーションで不完全さに悩まされています。
以前の研究では、通常、これらの下位微小性をランダムノイズとして扱います。
この論文では、非専門家のデモンストレーションで非最適な行動を研究し、それらが体系的であり、私たちがデモンストレーションサイドトラックと呼ぶものを形成していることを示しています。
長距離ロボットタスクを実行している40人の参加者がパブリックスペース調査を使用して、シミュレーションのセットアップを再現し、すべてのデモンストレーションに注釈を付けました。
4種類のサイドトラック（探査、間違い、アライメント、一時停止）と1つのコントロールパターン（1次元コントロール）を特定します。
サイドトラックは参加者全体に頻繁に表示され、その時間的および空間的分布はタスクコンテキストに結び付けられています。
また、ユーザーのコントロールパターンはコントロールインターフェイスに依存することがわかります。
これらの洞察は、LFDアルゴリズムを改善し、ラボトレーニングと実世界の展開のギャップを埋めるための準最適なデモンストレーションのより良いモデルの必要性を指摘しています。
すべてのデモンストレーション、インフラストラクチャ、および注釈は、https：//github.com/aabl-lab/human-demonstration-sidetracksで入手できます。

要約(オリジナル)

Learning from Demonstration (LfD) is a popular approach for robots to acquire new skills, but most LfD methods suffer from imperfections in human demonstrations. Prior work typically treats these suboptimalities as random noise. In this paper we study non-optimal behaviors in non-expert demonstrations and show that they are systematic, forming what we call demonstration sidetracks. Using a public space study with 40 participants performing a long-horizon robot task, we recreated the setup in simulation and annotated all demonstrations. We identify four types of sidetracks (Exploration, Mistake, Alignment, Pause) and one control pattern (one-dimension control). Sidetracks appear frequently across participants, and their temporal and spatial distribution is tied to task context. We also find that users’ control patterns depend on the control interface. These insights point to the need for better models of suboptimal demonstrations to improve LfD algorithms and bridge the gap between lab training and real-world deployment. All demonstrations, infrastructure, and annotations are available at https://github.com/AABL-Lab/Human-Demonstration-Sidetracks.

arxiv情報

著者	Shijie Fang,Hang Yu,Qidi Fang,Reuben M. Aronson,Elaine S. Short
発行日	2025-06-12 20:04:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, cs.RO | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント