jarxiv | Japanese arxiv | ページ 1069

Autonomous Human-Robot Interaction via Operator Imitation

投稿日: 2025年4月4日作成者: jarxiv

要約

遠隔操作されたロボットのキャラクタは、操作者の経験や社会的直感を頼りに、人間と表情豊かなインタラクションを行うことができる。本研究では、オペレータのデータを模倣するモデルを学習することで、自律的な対話ロボットを作成することを提案する。我々のモデルは、人間とロボットのインタラクションのデータセットを用いて学習される。このデータセットでは、熟練したオペレータがロボットのインタラクションやムードを変化させるよう依頼され、オペレータのコマンドや人間とロボットのポーズが記録される。我々のアプローチは、拡散過程を通して連続的なオペレータのコマンドを予測することを学習し、分類器を通して離散的なコマンドを予測することを学習する。我々は、結果として得られたモデルを、シミュレーションと、実システムを用いたユーザースタディで評価する。本手法により、専門家とオペレータのベースラインと同等の簡単な自律的な人間とロボットのインタラクションが可能になること、また、本モデルにより生成されたロボットの様々な気分をユーザが認識できることを示す。最後に、我々のモデルを、同じオペレータインタフェースを持つ別のロボットプラットフォーム上にゼロショットで転送することを実証する。

要約(オリジナル)

Teleoperated robotic characters can perform expressive interactions with humans, relying on the operators’ experience and social intuition. In this work, we propose to create autonomous interactive robots, by training a model to imitate operator data. Our model is trained on a dataset of human-robot interactions, where an expert operator is asked to vary the interactions and mood of the robot, while the operator commands as well as the pose of the human and robot are recorded. Our approach learns to predict continuous operator commands through a diffusion process and discrete commands through a classifier, all unified within a single transformer architecture. We evaluate the resulting model in simulation and with a user study on the real system. We show that our method enables simple autonomous human-robot interactions that are comparable to the expert-operator baseline, and that users can recognize the different robot moods as generated by our model. Finally, we demonstrate a zero-shot transfer of our model onto a different robotic platform with the same operator interface.

arxiv情報

著者	Sammy Christen,David Müller,Agon Serifi,Ruben Grandia,Georg Wiedebach,Michael A. Hopkins,Espen Knoop,Moritz Bächer
発行日	2025-04-03 16:06:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.RO | コメントを受け付けていません

MI-HGNN: Morphology-Informed Heterogeneous Graph Neural Network for Legged Robot Contact Perception

投稿日: 2025年4月4日作成者: jarxiv

要約

我々は、学習ベースの接触知覚のための形態情報異種グラフニューラルネットワーク(MI-HGNN)を提案する。MI-HGNNのアーキテクチャと接続性はロボットの形態から構築され、ノードとエッジはそれぞれロボットの関節とリンクである。形態情報に基づく制約をニューラルネットワークに組み込むことで、モデルベースの知識を用いた学習ベースのアプローチを改善する。提案するMI-HGNNを2つの接触知覚問題に適用し、2つの四足歩行ロボットを用いて収集した実データとシミュレーションデータの両方を用いた広範な実験を行う。実験の結果、有効性、汎化能力、モデル効率、サンプル効率の観点から、本手法の優位性が実証された。我々のMI-HGNNは、ロボットの形態学的対称性を利用した最先端のモデルの性能を、わずか0.21%のパラメータで8.4%向上させた。本研究では、MI-HGNNを脚式ロボットの接触知覚問題に適用したが、他のタイプのマルチボディ力学系にもシームレスに適用でき、他のロボット学習フレームワークを改善する可能性がある。我々のコードはhttps://github.com/lunarlab-gatech/Morphology-Informed-HGNN。

要約(オリジナル)

We present a Morphology-Informed Heterogeneous Graph Neural Network (MI-HGNN) for learning-based contact perception. The architecture and connectivity of the MI-HGNN are constructed from the robot morphology, in which nodes and edges are robot joints and links, respectively. By incorporating the morphology-informed constraints into a neural network, we improve a learning-based approach using model-based knowledge. We apply the proposed MI-HGNN to two contact perception problems, and conduct extensive experiments using both real-world and simulated data collected using two quadruped robots. Our experiments demonstrate the superiority of our method in terms of effectiveness, generalization ability, model efficiency, and sample efficiency. Our MI-HGNN improved the performance of a state-of-the-art model that leverages robot morphological symmetry by 8.4% with only 0.21% of its parameters. Although MI-HGNN is applied to contact perception problems for legged robots in this work, it can be seamlessly applied to other types of multi-body dynamical systems and has the potential to improve other robot learning frameworks. Our code is made publicly available at https://github.com/lunarlab-gatech/Morphology-Informed-HGNN.

arxiv情報

著者	Daniel Butterfield,Sandilya Sai Garimella,Nai-Jen Cheng,Lu Gan
発行日	2025-04-03 16:23:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.RO, I.2.6 | コメントを受け付けていません

Robot-Led Vision Language Model Wellbeing Assessment of Children

投稿日: 2025年4月4日作成者: jarxiv

要約

本研究は、視覚言語モデル（VLM）を用いて、子どもの精神的ウェルビーイングを評価する新しいロボット主導のアプローチを提示する。児童認知テスト（CAT）にヒントを得て、社会的ロボットNAOが子どもたちに絵の刺激を提示し、その画像について子どもたちの言語による語りを引き出し、それをCATの評価ガイドラインに従ってVLMが評価した。VLMの評価は、訓練を受けた心理学者による評価と系統的に比較された。その結果、VLMはウェルビーイングの懸念がない症例の識別においては中程度の信頼性を示すものの、臨床的懸念のある評価を正確に分類する能力は依然として限定的であることが明らかになった。さらに、モデルの性能は、年齢や性別などの人口統計学的要因を変化させても概ね一貫していたが、女児では有意に高い偽陽性率が観察され、性別属性に敏感である可能性が示された。これらの知見は、VLMをロボット主導の子どものウェルビーイング評価に統合することの可能性と課題を浮き彫りにしている。

要約(オリジナル)

This study presents a novel robot-led approach to assessing children’s mental wellbeing using a Vision Language Model (VLM). Inspired by the Child Apperception Test (CAT), the social robot NAO presented children with pictorial stimuli to elicit their verbal narratives of the images, which were then evaluated by a VLM in accordance with CAT assessment guidelines. The VLM’s assessments were systematically compared to those provided by a trained psychologist. The results reveal that while the VLM demonstrates moderate reliability in identifying cases with no wellbeing concerns, its ability to accurately classify assessments with clinical concern remains limited. Moreover, although the model’s performance was generally consistent when prompted with varying demographic factors such as age and gender, a significantly higher false positive rate was observed for girls, indicating potential sensitivity to gender attribute. These findings highlight both the promise and the challenges of integrating VLMs into robot-led assessments of children’s wellbeing.

arxiv情報

著者	Nida Itrat Abbasi,Fethiye Irmak Dogan,Guy Laban,Joanna Anderson,Tamsin Ford,Peter B. Jones,Hatice Gunes
発行日	2025-04-03 17:02:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.RO | コメントを受け付けていません

Online Hybrid-Belief POMDP with Coupled Semantic-Geometric Models and Semantic Safety Awareness

投稿日: 2025年4月4日作成者: jarxiv

要約

複雑で未知の環境で動作するロボットは、タスクを安全に実行するために、環境の幾何学的意味表現を頻繁に必要とする。環境を推測しながら、将来の行動を計画する際には、多くの可能なシナリオを考慮しなければならない。物体のクラスタイプは離散的であり、ロボットの自己ポーズと物体のポーズは連続的であるため、環境はモデルと入力データに従って更新される離散-連続ハイブリッド信念によって表現することができる。環境を表す事前確率と観測モデルは、ディープラーニングアルゴリズムを用いてデータから学習することができる。このようなモデルは多くの場合、環境の意味的特性と幾何学的特性を結びつける。その結果、意味変数は相互に接続され、意味状態空間の次元が指数関数的に増大する。本論文では、ハイブリッドな意味的-幾何学的信念を持つ部分観測可能マルコフ決定過程（POMDP）を用いた不確実性の下での計画について考察する。モデルと事前分布は意味変数と幾何変数の間の結合を考慮する。POMDPの中で、意味的安全性（semantically aware safety）という概念を導入する。価値関数の推定に必要な理論的ハイブリッド信念の代表サンプルを得ることは非常に困難である。重要な貢献として、我々はハイブリッド信念の新しい形式を開発し、それを活用して代表サンプルをサンプリングする。ある条件下で、可能な全ての意味的マッピングに対する明示的な期待値で、価値関数と安全確率を効率的に計算できることを示す。我々のシミュレーションにより、理論的ハイブリッド信念からのサンプルを用いて意味状態空間全体を網羅的に実行する推定量と比較して、目的関数と安全確率の我々の推定量が同程度の精度を達成することが示された。とはいえ、我々の推定器の複雑さは指数関数的ではなく多項式的である。

要約(オリジナル)

Robots operating in complex and unknown environments frequently require geometric-semantic representations of the environment to safely perform their tasks. While inferring the environment, they must account for many possible scenarios when planning future actions. Since objects’ class types are discrete and the robot’s self-pose and the objects’ poses are continuous, the environment can be represented by a hybrid discrete-continuous belief which is updated according to models and incoming data. Prior probabilities and observation models representing the environment can be learned from data using deep learning algorithms. Such models often couple environmental semantic and geometric properties. As a result, semantic variables are interconnected, causing semantic state space dimensionality to increase exponentially. In this paper, we consider planning under uncertainty using partially observable Markov decision processes (POMDPs) with hybrid semantic-geometric beliefs. The models and priors consider the coupling between semantic and geometric variables. Within POMDP, we introduce the concept of semantically aware safety. Obtaining representative samples of the theoretical hybrid belief, required for estimating the value function, is very challenging. As a key contribution, we develop a novel form of the hybrid belief and leverage it to sample representative samples. We show that under certain conditions, the value function and probability of safety can be calculated efficiently with an explicit expectation over all possible semantic mappings. Our simulations show that our estimates of the objective function and probability of safety achieve similar levels of accuracy compared to estimators that run exhaustively on the entire semantic state-space using samples from the theoretical hybrid belief. Nevertheless, the complexity of our estimators is polynomial rather than exponential.

arxiv情報

著者	Tuvy Lemberg,Vadim Indelman
発行日	2025-04-03 17:14:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.RO, none | コメントを受け付けていません

BT-ACTION: A Test-Driven Approach for Modular Understanding of User Instruction Leveraging Behaviour Trees and LLMs

投稿日: 2025年4月4日作成者: jarxiv

要約

自然言語による指示は抽象的で複雑なことが多く、一見単純なクエリであっても、ロボットは複数のサブタスクを実行する必要がある。例えば、ユーザーがロボットにアボカドトーストの調理を依頼する場合、そのタスクにはいくつかの連続したステップが含まれる。さらに、このような指示は、ロボットにとって曖昧であったり、実行不可能であったり、ロボットの既存の知識を超えている場合がある。大規模言語モデル（Large Language Models: LLM）は、このような課題に対処するための強力な言語推論機能を提供するが、ロボットシステムに効果的に統合することは、依然として重要な課題である。この課題に対処するため、我々はBT-ACTIONを提案する。BT-ACTIONは、行動ツリー（Behavior Trees：BT）のモジュール構造とLLMを組み合わせたテスト駆動型のアプローチであり、特にキッチン支援環境におけるレシピ作成の文脈において、複雑なユーザの指示に従うための首尾一貫したロボット動作シーケンスを生成する。我々は、45人の参加者を対象とした包括的なユーザー研究において、BT-ACTIONを評価し、その性能をLLMによる直接プロンプトと比較した。その結果、BT-ACTIONのモジュール設計は、ロボットのミスを減らし、ユーザーの信頼を高めるのに役立ち、参加者はBT-ACTIONを活用したロボットを有意に好むことが示された。コードはhttps://github.com/1Eggbert7/BT_LLM。

要約(オリジナル)

Natural language instructions are often abstract and complex, requiring robots to execute multiple subtasks even for seemingly simple queries. For example, when a user asks a robot to prepare avocado toast, the task involves several sequential steps. Moreover, such instructions can be ambiguous or infeasible for the robot or may exceed the robot’s existing knowledge. While Large Language Models (LLMs) offer strong language reasoning capabilities to handle these challenges, effectively integrating them into robotic systems remains a key challenge. To address this, we propose BT-ACTION, a test-driven approach that combines the modular structure of Behavior Trees (BT) with LLMs to generate coherent sequences of robot actions for following complex user instructions, specifically in the context of preparing recipes in a kitchen-assistance setting. We evaluated BT-ACTION in a comprehensive user study with 45 participants, comparing its performance to direct LLM prompting. Results demonstrate that the modular design of BT-ACTION helped the robot make fewer mistakes and increased user trust, and participants showed a significant preference for the robot leveraging BT-ACTION. The code is publicly available at https://github.com/1Eggbert7/BT_LLM.

arxiv情報

著者	Alexander Leszczynski,Sarah Gillet,Iolanda Leite,Fethiye Irmak Dogan
発行日	2025-04-03 17:19:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.RO | コメントを受け付けていません

GRACE: Generating Socially Appropriate Robot Actions Leveraging LLMs and Human Explanations

投稿日: 2025年4月4日作成者: jarxiv

要約

人間が生活する環境では、ロボットは社会規範を守りつつ、個人の嗜好に合わせながら複雑なタスクを処理する必要がある。例えば、家庭用ロボットは常識的な知識に基づいて、社交の場では掃除機をかけるのを避けるべきだと予測できるが、来客の前と後のどちらで掃除機をかけるべきかはまだわからないかもしれない。このような場合、常識的な知識と、しばしば人間の説明を通じて伝えられる人間の嗜好を統合することは、基本的なことであるが、既存のシステムにとっては課題である。本論文では、社会的に適切なロボットの行動を生成しながら、これに対処する新しいアプローチであるGRACEを紹介する。GRACEは、LLMから得られる常識的な知識を活用し、この知識を生成ネットワークを通して人間の説明と統合する。GRACEの双方向構造により、ロボットは人間の説明を利用することでLLMの予測を洗練・強化することができ、また人間が指定した行動に対してそのような説明を生成することができるようになる。我々の評価では、人間の説明を統合することで、GRACEの性能が向上し、いくつかのベースラインを凌駕し、賢明な説明を提供することが示された。

要約(オリジナル)

When operating in human environments, robots need to handle complex tasks while both adhering to social norms and accommodating individual preferences. For instance, based on common sense knowledge, a household robot can predict that it should avoid vacuuming during a social gathering, but it may still be uncertain whether it should vacuum before or after having guests. In such cases, integrating common-sense knowledge with human preferences, often conveyed through human explanations, is fundamental yet a challenge for existing systems. In this paper, we introduce GRACE, a novel approach addressing this while generating socially appropriate robot actions. GRACE leverages common sense knowledge from LLMs, and it integrates this knowledge with human explanations through a generative network. The bidirectional structure of GRACE enables robots to refine and enhance LLM predictions by utilizing human explanations and makes robots capable of generating such explanations for human-specified actions. Our evaluations show that integrating human explanations boosts GRACE’s performance, where it outperforms several baselines and provides sensible explanations.

arxiv情報

著者	Fethiye Irmak Dogan,Umut Ozyurt,Gizem Cinar,Hatice Gunes
発行日	2025-04-03 17:31:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.RO | コメントを受け付けていません

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

投稿日: 2025年4月4日作成者: jarxiv

要約

模倣学習は、汎用のロボットを構築するための有望なアプローチとして浮上してきた。しかし、大規模なロボット基礎モデルのための模倣学習の拡張は、高品質の専門家のデモンストレーションに依存するため、依然として困難である。一方、様々な環境と多様な行動を撮影した大量のビデオデータが容易に入手可能である。このデータは、実世界のダイナミクスやエージェントと環境の相互作用に関する豊富な情報源となる。しかし、このようなデータを模倣学習に直接活用することは、ほとんどの現代的な手法に必要な行動注釈がないため、困難であることが判明している。本研究では、映像データと行動データの両方を政策学習に活用することを可能にするフレームワークである統合世界モデル（Unified World Models：UWM）を提示する。具体的には、UWMは、アクション拡散プロセスとビデオ拡散プロセスを統一された変換器アーキテクチャ内に統合し、独立した拡散タイムステップがそれぞれのモダリティを支配する。各拡散タイムステップを制御するだけで、UWMは政策、順方向ダイナミクス、逆方向ダイナミクス、ビデオジェネレータを柔軟に表現できることを示す。シミュレーションと実世界での実験を通して、以下のことを示す：(1)UWMは、ダイナミクスと行動予測の両方を持つ大規模なマルチタスクロボットデータセットに対する効果的な事前学習を可能にし、模倣学習よりも一般化可能で頑健な方針をもたらす。(2)UWMは、モダリティ固有の拡散タイムステップの独立した制御により、行動のないビデオデータからの学習を自然に容易にし、微調整された方針の性能をさらに向上させる。我々の結果は、UWMがスケーラブルなロボット学習のために大規模で異種データセットを利用するための有望なステップを提供し、模倣学習と世界モデリングのしばしば異質なパラダイム間の単純な統一を提供することを示唆している。動画とコードはhttps://weirdlabuw.github.io/uwm/。

要約(オリジナル)

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation required for most contemporary methods. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. We show that by simply controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

arxiv情報

著者	Chuning Zhu,Raymond Yu,Siyuan Feng,Benjamin Burchfiel,Paarth Shah,Abhishek Gupta
発行日	2025-04-03 17:38:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.LG, cs.RO | コメントを受け付けていません

Scaling Laws in Scientific Discovery with AI and Robot Scientists

投稿日: 2025年4月4日作成者: jarxiv

要約

科学的発見は、高度なロボット工学と人工知能によって急速に進歩する態勢にある。現在の科学的実践は、手作業による実験に時間と資源がかかる一方で、学際的な研究では個々の研究者の専門分野の枠を超えた知識の統合が要求されるなど、大きな制限に直面している。ここでは、エージェント型AIと体現型ロボティクスを組み合わせ、研究ライフサイクル全体を自動化する自律型ジェネラリスト・サイエンティスト（AGS）のコンセプトを構想する。このシステムは、多様な科学分野にまたがる知識の統合を促進しながら、物理環境と仮想環境の両方と動的に相互作用することができる。文献調査、仮説生成、実験、原稿執筆など、あらゆる研究段階を通じてこれらの技術を導入し、外部からのフィードバックとともに内部での内省を取り入れることで、このシステムは科学的発見に必要な時間とリソースを大幅に削減することを目指している。バーチャルAI科学者から多才なジェネラリストAIベースのロボット科学者への進化に基づき、AGSは画期的な可能性を約束する。このような自律システムが研究プロセスにますます統合されるにつれ、科学的発見は、これらの自律システムの数と能力によって形作られる可能性のある、新たなスケーリング法則に従うようになるかもしれないという仮説を立て、知識がどのように生成され、進化するかについて新たな視点を提供する。極限環境に対する具現化ロボットの適応性は、科学的知識の蓄積によるフライホイール効果と相まって、物理的・知的フロンティアの両方を継続的に押し広げる可能性を秘めている。

要約(オリジナル)

Scientific discovery is poised for rapid advancement through advanced robotics and artificial intelligence. Current scientific practices face substantial limitations as manual experimentation remains time-consuming and resource-intensive, while multidisciplinary research demands knowledge integration beyond individual researchers’ expertise boundaries. Here, we envision an autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle. This system could dynamically interact with both physical and virtual environments while facilitating the integration of knowledge across diverse scientific disciplines. By deploying these technologies throughout every research stage — spanning literature review, hypothesis generation, experimentation, and manuscript writing — and incorporating internal reflection alongside external feedback, this system aims to significantly reduce the time and resources needed for scientific discovery. Building on the evolution from virtual AI scientists to versatile generalist AI-based robot scientists, AGS promises groundbreaking potential. As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws, potentially shaped by the number and capabilities of these autonomous systems, offering novel perspectives on how knowledge is generated and evolves. The adaptability of embodied robots to extreme environments, paired with the flywheel effect of accumulating scientific knowledge, holds the promise of continually pushing beyond both physical and intellectual frontiers.

arxiv情報

著者	Pengsong Zhang,Heng Zhang,Huazhe Xu,Renjun Xu,Zhenting Wang,Cong Wang,Animesh Garg,Zhibin Li,Arash Ajoudani,Xinyu Liu
発行日	2025-04-03 17:55:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CL, cs.RO | コメントを受け付けていません

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

投稿日: 2025年4月4日作成者: jarxiv

要約

運動と空間に関する推論は、複数の実世界アプリケーションで必要とされる基本的な認知能力である。多くの研究が、大規模なマルチモーダル言語モデル（MLM）が空間に関する推論に苦労していることを強調しているが、それらは静的な空間関係にのみ焦点を当てており、動きと空間に関する動的な認識、すなわち、空間関係に対する自心運動や物体運動の影響に関する推論には焦点を当てていない。このようなオブジェクトやカメラの動きを手動でアノテートするのはコストがかかる。そこで、我々は、17万5千の質問と回答（QA）のペアと2万シーンにわたる静的および動的な空間的推論からなる模擬空間適性訓練データセットであるSATを導入する。これを補完するために、我々はまた、実世界の画像を用いて、小規模（150画像-QA）でありながら挑戦的な動的空間テストセットを構築する。我々のSATデータセットと既存の6つの静的空間ベンチマークを活用し、静的空間認識と動的空間認識の両方を向上させるものを系統的に調査する。その結果、シミュレーションは、MLMに空間適性を付与する上で驚くほど効果的であり、それが実画像に反映されることが明らかになった。シミュレーションにおける完全なアノテーションは、実画像を擬似的にアノテーションする既存のアプローチよりも効果的であることを示す。例えば、SATのトレーニングは、実画像の動的テストセットや長い動画に対する空間推論を含む複数の空間ベンチマークにおいて、LLaVA-13Bモデルを平均11％、LLaVA-Video-7Bモデルを平均8％向上させる。静的な関係に対する推論は合成訓練データによって改善されるが、動的な推論問題にはまだかなりの改善の余地がある。

要約(オリジナル)

Reasoning about motion and space is a fundamental cognitive capability that is required by multiple real-world applications. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only focus on static spatial relationships, and not dynamic awareness of motion and space, i.e., reasoning about the effect of egocentric and object motions on spatial relationships. Manually annotating such object and camera movements is expensive. Hence, we introduce SAT, a simulated spatial aptitude training dataset comprising both static and dynamic spatial reasoning across 175K question-answer (QA) pairs and 20K scenes. Complementing this, we also construct a small (150 image-QAs) yet challenging dynamic spatial test set using real-world images. Leveraging our SAT datasets and 6 existing static spatial benchmarks, we systematically investigate what improves both static and dynamic spatial awareness. Our results reveal that simulations are surprisingly effective at imparting spatial aptitude to MLMs that translate to real images. We show that perfect annotations in simulation are more effective than existing approaches of pseudo-annotating real images. For instance, SAT training improves a LLaVA-13B model by an average 11% and a LLaVA-Video-7B model by an average 8% on multiple spatial benchmarks, including our real-image dynamic test set and spatial reasoning on long videos — even outperforming some large proprietary models. While reasoning over static relationships improves with synthetic training data, there is still considerable room for improvement for dynamic reasoning questions.

arxiv情報

著者	Arijit Ray,Jiafei Duan,Ellis Brown,Reuben Tan,Dina Bashkirova,Rose Hendrix,Kiana Ehsani,Aniruddha Kembhavi,Bryan A. Plummer,Ranjay Krishna,Kuo-Hao Zeng,Kate Saenko
発行日	2025-04-03 17:59:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.CV, cs.GR, cs.RO | コメントを受け付けていません

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

投稿日: 2025年4月4日作成者: jarxiv

要約

大規模言語モデル(LLM)の登場は、機械翻訳(MT)の状況を大きく変えてきた。特に、十分なパラレルコーパス、言語ツール、計算インフラがない低リソース言語やドメインにおいて顕著である。本サーベイでは、LLMをMTに活用するための最近の進歩を包括的に紹介する。リソースの乏しい環境にも効果的に適応できるような、少数ショットのプロンプティング、クロスリンガル・トランスファー、パラメータ効率の良い微調整などのテクニックを分析する。また、逆翻訳や語彙増強など、LLMを用いた合成データ生成戦略についても検討する。さらに、様々な言語ペアにおいて、LLMベースの翻訳と従来のエンコーダ・デコーダモデルを比較し、それぞれの長所と限界を明らかにする。また、幻覚、評価の矛盾、遺伝的なバイアスなどの永続的な課題について論じるとともに、LLMに基づく新たな翻訳品質評価指標についても評価する。この調査は、大規模な生成モデルの時代に、堅牢で包括的かつスケーラブルなMTシステムを構築するための実用的な洞察を提供し、将来の方向性を概説する。

要約(オリジナル)

The advent of Large Language Models (LLMs) has significantly reshaped the landscape of machine translation (MT), particularly for low-resource languages and domains that lack sufficient parallel corpora, linguistic tools, and computational infrastructure. This survey presents a comprehensive overview of recent progress in leveraging LLMs for MT. We analyze techniques such as few-shot prompting, cross-lingual transfer, and parameter-efficient fine-tuning that enable effective adaptation to under-resourced settings. The paper also explores synthetic data generation strategies using LLMs, including back-translation and lexical augmentation. Additionally, we compare LLM-based translation with traditional encoder-decoder models across diverse language pairs, highlighting the strengths and limitations of each. We discuss persistent challenges such as hallucinations, evaluation inconsistencies, and inherited biases while also evaluating emerging LLM-driven metrics for translation quality. This survey offers practical insights and outlines future directions for building robust, inclusive, and scalable MT systems in the era of large-scale generative models.

arxiv情報

著者	Baban Gain,Dibyanayan Bandyopadhyay,Asif Ekbal
発行日	2025-04-03 13:30:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント