jarxiv | Japanese arxiv | ページ 820

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

投稿日: 2025年4月23日作成者: jarxiv

要約

逆強化学習（IRL）では、エージェントは環境とのやり取りを通じて専門家のデモを再現しようとしています。
伝統的に、IRLは敵対的なゲームとして扱われ、そこでは敵が報酬モデルを検索し、学習者は繰り返されるRL手順を通じて報酬を最適化します。
このゲームを解決するアプローチは、計算上高価であり、安定するのが困難です。
この作業では、直接的なポリシーの最適化によるIRLへの新しいアプローチを提案します。後継者の機能と報酬ベクトルの内部産物としてのリターンの線形因数分解を活用することで、学習者と専門家の特徴のギャップに対するポリシー勾配降下によるIRLアルゴリズムを設計します。
私たちの非逆数法は、報酬機能の学習を必要とせず、既存のアクターcritic RLアルゴリズムでシームレスに解決できます。
驚くべきことに、私たちのアプローチは、専門家のアクションラベルなしで州のみの設定で機能します。これは、動作クローニング（BC）が解決できない設定です。
経験的な結果は、私たちの方法が単一の専門家のデモンストレーションと同じくらい少数から学習し、さまざまな制御タスクでパフォーマンスを向上させることを実証しています。

要約(オリジナル)

In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

arxiv情報

著者	Arnav Kumar Jain,Harley Wiltzer,Jesse Farebrother,Irina Rish,Glen Berseth,Sanjiban Choudhury
発行日	2025-04-22 17:59:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

Expanding the Generative AI Design Space through Structured Prompting and Multimodal Interfaces

投稿日: 2025年4月23日作成者: jarxiv

要約

テキストベースのプロンプトは、生成AIの主要な相互作用パラダイムのままですが、多くの場合、広告などのドメイン固有のコンテキストで創造的な目標を明確にするのに苦労している中小企業所有者（SBO）などの初心者ユーザーに摩擦を導入します。
英国の6つのSBOを使用した形成的研究を通じて、3つの重要な課題を特定します。プロンプトを通じてブランドの直観を表現するのが難しいこと、コンテンツ生成中およびコンテンツ生成後の細かい調整と改良の機会が限られていること、およびブランドの特異性を欠く一般的なコンテンツの頻繁な生産。
これに応じて、従来の迅速なインターフェイスを超えて初心者のデザイナーをサポートするように設計されたマルチモーダル生成AIツールであるACAI（AIの共同創造）を提示します。
ACAIは、ブランディング、オーディエンス、目標、およびインスピレーションボードの3つのパネルで構成される構造化された入力システムを備えています。
これらの入力により、ユーザーはブランド関連のコンテキストと視覚的な好みを伝えることができます。
この作業は、構造化されたインターフェイスがユーザー定義のコンテキストを前景にし、アラインメントを改善し、初心者のクリエイティブワークフローにおける共同作成制御を強化する方法を示すことにより、生成システムに関するHCI研究に貢献します。

要約(オリジナル)

Text-based prompting remains the predominant interaction paradigm in generative AI, yet it often introduces friction for novice users such as small business owners (SBOs), who struggle to articulate creative goals in domain-specific contexts like advertising. Through a formative study with six SBOs in the United Kingdom, we identify three key challenges: difficulties in expressing brand intuition through prompts, limited opportunities for fine-grained adjustment and refinement during and after content generation, and the frequent production of generic content that lacks brand specificity. In response, we present ACAI (AI Co-Creation for Advertising and Inspiration), a multimodal generative AI tool designed to support novice designers by moving beyond traditional prompt interfaces. ACAI features a structured input system composed of three panels: Branding, Audience and Goals, and the Inspiration Board. These inputs allow users to convey brand-relevant context and visual preferences. This work contributes to HCI research on generative systems by showing how structured interfaces can foreground user-defined context, improve alignment, and enhance co-creative control in novice creative workflows.

arxiv情報

著者	Nimisha Karnatak,Adrien Baranes,Rob Marchant,Huinan Zeng,Tríona Butler,Kristen Olson
発行日	2025-04-22 17:59:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.HC | コメントを受け付けていません

Bayesian Cross-Modal Alignment Learning for Few-Shot Out-of-Distribution Generalization

投稿日: 2025年4月23日作成者: jarxiv

要約

大規模な事前に訓練されたモデルの最近の進歩は、少ないショット学習で有望な結果を示しました。
ただし、2次元分布（OOD）データ、つまり相関シフトと多様性シフトに関する一般化能力は徹底的に調査されていません。
調査によると、かなりの量のトレーニングデータがあっても、OOD一般化における標準的な経験的リスク最小化方法（ERM）よりも優れたパフォーマンスを実現できる方法はほとんどありません。
この少ないショットOOD一般化のジレンマは、深いニューラルネットワーク一般化研究の挑戦的な方向として浮上し、パフォーマンスは少数のショットの例とOOD一般化エラーに過度に適合することに苦しんでいます。
この論文では、より広い監督ソースを活用して、この問題に対処するために、新しいベイジアンクロスモーダル画像アライメント学習方法（ベイズ-CAL）を探ります。
具体的には、このモデルは、勾配直交化の損失と不変リスク最小化（IRM）損失を伴うベイジアンモデリングアプローチを介して、テキスト表現のみが微調整されるように設計されています。
ベイジアンアプローチは、トレーニング中に観察された基本クラスの過剰適合を避け、より広い目に見えないクラスへの一般化を改善するために、本質的に導入されています。
専用の損失は、画像機能の因果関係と非カジュアルな部分を解き放つことにより、より良い画像テキストアラインメントを実現するために導入されます。
数値実験は、ベイズ・カルが2次元分布シフトで最先端のOOD一般化パフォーマンスを達成したことを示しています。
さらに、Clipのようなモデルと比較して、Bayes-Calは、目に見えないクラスでより安定した一般化パフォーマンスをもたらします。
私たちのコードは、https：//github.com/linllll/bayescalで入手できます。

要約(オリジナル)

Recent advances in large pre-trained models showed promising results in few-shot learning. However, their generalization ability on two-dimensional Out-of-Distribution (OoD) data, i.e., correlation shift and diversity shift, has not been thoroughly investigated. Researches have shown that even with a significant amount of training data, few methods can achieve better performance than the standard empirical risk minimization method (ERM) in OoD generalization. This few-shot OoD generalization dilemma emerges as a challenging direction in deep neural network generalization research, where the performance suffers from overfitting on few-shot examples and OoD generalization errors. In this paper, leveraging a broader supervision source, we explore a novel Bayesian cross-modal image-text alignment learning method (Bayes-CAL) to address this issue. Specifically, the model is designed as only text representations are fine-tuned via a Bayesian modelling approach with gradient orthogonalization loss and invariant risk minimization (IRM) loss. The Bayesian approach is essentially introduced to avoid overfitting the base classes observed during training and improve generalization to broader unseen classes. The dedicated loss is introduced to achieve better image-text alignment by disentangling the causal and non-casual parts of image features. Numerical experiments demonstrate that Bayes-CAL achieved state-of-the-art OoD generalization performances on two-dimensional distribution shifts. Moreover, compared with CLIP-like models, Bayes-CAL yields more stable generalization performances on unseen classes. Our code is available at https://github.com/LinLLLL/BayesCAL.

arxiv情報

著者	Lin Zhu,Xinbing Wang,Chenghu Zhou,Nanyang Ye
発行日	2025-04-22 10:59:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

MObI: Multimodal Object Inpainting Using Diffusion Models

投稿日: 2025年4月23日作成者: jarxiv

要約

自律運転などの安全性の高いアプリケーションには、厳密なテストのために広範なマルチモーダルデータが必要です。
合成データに基づく方法は、実際のデータを収集するコストと複雑さのために顕著になりますが、有用になるには高度なリアリズムと制御性が必要です。
このペーパーでは、カメラとライダーの両方に同時に実証された、拡散モデルを活用して現実的で制御可能なオブジェクトを作成して、現実的で制御可能なオブジェクトを作成するマルチモーダルオブジェクトの斬新なフレームワークであるMobiを紹介します。
単一の参照RGBイメージを使用して、Mobiは、セマンティックの一貫性とマルチモーダルコヒーレンスを維持しながら、境界ボックスで指定された3D位置で既存のマルチモーダルシーンにシームレスに挿入できるようにします。
編集マスクのみに依存する従来のインペインティング方法とは異なり、3Dバウンディングボックスコンディショニングは、オブジェクトに正確な空間位置と現実的なスケーリングを提供します。
その結果、私たちのアプローチを使用して、新しいオブジェクトをマルチモーダルシーンに柔軟に挿入し、知覚モデルのテストに大きな利点を提供します。

要約(オリジナル)

Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.

arxiv情報

著者	Alexandru Buburuzan,Anuj Sharma,John Redford,Puneet K. Dokania,Romain Mueller
発行日	2025-04-22 11:09:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Development and evaluation of a deep learning algorithm for German word recognition from lip movements

投稿日: 2025年4月23日作成者: jarxiv

要約

唇を読むとき、多くの人々は、スピーカーの唇の動きからの追加の視覚情報の恩恵を受けますが、これは非常にエラーが発生しやすいです。
人工ニューラルネットワークに基づいた人工知能を使用したリップリーディング用のアルゴリズムは、単語認識を大幅に改善しますが、ドイツ語では利用できません。
それぞれ1人のドイツ語を話す人が1人だけの合計1806のビデオクリップが選択され、単語セグメントに分割され、音声認識ソフトウェアを使用して単語クラスに割り当てられました。
32のスピーカーを備えた38,391のビデオセグメントでは、18の多音節の視覚的に識別可能な単語を使用して、ニューラルネットワークを訓練および検証しました。
3D畳み込みニューラルネットワークとゲートの再発ユニットモデルと両方のモデル（GrucONV）の組み合わせを比較しました。
精度は、5000のトレーニングエポックで決定されました。
カラースペースの比較では、69％から72％の範囲の関連する異なる正しい分類率は明らかになりませんでした。
唇が切断されると、スピーカーの顔全体（34％）に切断されたときよりも70％の精度が大幅に達成されました。
GrucONVモデルでは、最大精度は既知のスピーカーで87％、未知のスピーカーを使用した検証で63％でした。
最初にドイツ語用に開発されたリップリーディングのニューラルネットワークは、英語のアルゴリズムに匹敵する非常に高いレベルの精度を示しています。
不明なスピーカーでも動作し、より多くの単語クラスで一般化できます。

要約(オリジナル)

When reading lips, many people benefit from additional visual information from the lip movements of the speaker, which is, however, very error prone. Algorithms for lip reading with artificial intelligence based on artificial neural networks significantly improve word recognition but are not available for the German language. A total of 1806 video clips with only one German-speaking person each were selected, split into word segments, and assigned to word classes using speech-recognition software. In 38,391 video segments with 32 speakers, 18 polysyllabic, visually distinguishable words were used to train and validate a neural network. The 3D Convolutional Neural Network and Gated Recurrent Units models and a combination of both models (GRUConv) were compared, as were different image sections and color spaces of the videos. The accuracy was determined in 5000 training epochs. Comparison of the color spaces did not reveal any relevant different correct classification rates in the range from 69% to 72%. With a cut to the lips, a significantly higher accuracy of 70% was achieved than when cut to the entire speaker’s face (34%). With the GRUConv model, the maximum accuracies were 87% with known speakers and 63% in the validation with unknown speakers. The neural network for lip reading, which was first developed for the German language, shows a very high level of accuracy, comparable to English-language algorithms. It works with unknown speakers as well and can be generalized with more word classes.

arxiv情報

著者	Dinh Nam Pham,Torsten Rahne
発行日	2025-04-22 11:12:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Locating and Mitigating Gradient Conflicts in Point Cloud Domain Adaptation via Saliency Map Skewness

投稿日: 2025年4月23日作成者: jarxiv

要約

ポイントクラウドデータを使用しているオブジェクト分類モデルは、3Dメディアの理解の基本ですが、目に見えないまたは分散除外（OOD）シナリオに苦労することがよくあります。
既存のPoint Cloud監視されていないドメイン適応（UDA）メソッドは、通常、プライマリ分類タスクと補助的なセルフスーパービジョンタスクを組み合わせて、ドメインの特徴分布間のギャップを埋めるためのマルチタスク学習（MTL）フレームワークを採用しています。
しかし、さらなる実験は、自己監視タスクのすべての勾配が有益ではないことを示しており、一部は分類パフォーマンスに悪影響を与える可能性があります。
この論文では、これらの勾配競合を緩和するために、顕著性マップベースのデータサンプリングブロック（SM-DSB）と呼ばれる新しいソリューションを提案します。
具体的には、我々の方法は、ターゲットラベルを必要とせずに勾配競合を推定するために、3D顕著性マップの歪度に基づいた新しいスコアリングメカニズムを設計します。
これを活用して、セルフスーパービジョンの勾配が分類に有益ではないサンプルを動的に除外するサンプル選択戦略を開発します。
私たちのアプローチはスケーラブルで、控えめな計算オーバーヘッドを導入し、すべてのポイントクラウドUDA MTLフレームワークに統合できます。
広範な評価は、私たちの方法が最先端のアプローチよりも優れていることを示しています。
さらに、バックプロパゲーション分析を通じてUDAの問題を理解することに関する新しい視点を提供します。

要約(オリジナル)

Object classification models utilizing point cloud data are fundamental for 3D media understanding, yet they often struggle with unseen or out-of-distribution (OOD) scenarios. Existing point cloud unsupervised domain adaptation (UDA) methods typically employ a multi-task learning (MTL) framework that combines primary classification tasks with auxiliary self-supervision tasks to bridge the gap between cross-domain feature distributions. However, our further experiments demonstrate that not all gradients from self-supervision tasks are beneficial and some may negatively impact the classification performance. In this paper, we propose a novel solution, termed Saliency Map-based Data Sampling Block (SM-DSB), to mitigate these gradient conflicts. Specifically, our method designs a new scoring mechanism based on the skewness of 3D saliency maps to estimate gradient conflicts without requiring target labels. Leveraging this, we develop a sample selection strategy that dynamically filters out samples whose self-supervision gradients are not beneficial for the classification. Our approach is scalable, introducing modest computational overhead, and can be integrated into all the point cloud UDA MTL frameworks. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches. In addition, we provide a new perspective on understanding the UDA problem through back-propagation analysis.

arxiv情報

著者	Jiaqi Tang,Yinsong Xu,Qingchao Chen
発行日	2025-04-22 11:16:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Localization Meets Uncertainty: Uncertainty-Aware Multi-Modal Localization

投稿日: 2025年4月23日作成者: jarxiv

要約

信頼性の高いローカリゼーションは、複雑な屋内環境でのロボットナビゲーションにとって重要です。
このホワイトペーパーでは、予測モデル自体を変更せずにローカリゼーション出力の信頼性を高める不確実性を意識するローカリゼーション方法を提案します。
この研究では、ネットワークが推定するaleatoricおよび認識論の不確実性に基づいて、信頼できない3-DOFポーズ予測を除外するパーセンタイルベースの拒否戦略を導入します。
このアプローチを、RGB画像と2D LIDARデータを融合するマルチモーダルエンドツーエンドのローカリゼーションに適用し、商業化されたサービングロボットを使用して収集された3つの実際のデータセットで評価します。
実験結果は、より厳しい不確実性のしきい値を適用すると、ポーズの精度が一貫して改善されることを示しています。
具体的には、平均位置誤差は、それぞれ90％、80％、および70％のしきい値を適用する場合、それぞれ41.0％、56.7％、および69.4％、平均方向誤差は55.6％、65.7％、73.3％減少します。
さらに、拒否戦略は極端な外れ値を効果的に除去し、地上の真理軌道とのより良い整合をもたらします。
私たちの知る限り、これは、マルチモーダルのエンドツーエンドのローカリゼーションタスクにおけるパーセンタイルベースの不確実性拒絶の利点を定量的に実証する最初の研究です。
私たちのアプローチは、実際の展開におけるローカリゼーションシステムの信頼性と精度を高めるための実用的な手段を提供します。

要約(オリジナル)

Reliable localization is critical for robot navigation in complex indoor environments. In this paper, we propose an uncertainty-aware localization method that enhances the reliability of localization outputs without modifying the prediction model itself. This study introduces a percentile-based rejection strategy that filters out unreliable 3-DoF pose predictions based on aleatoric and epistemic uncertainties the network estimates. We apply this approach to a multi-modal end-to-end localization that fuses RGB images and 2D LiDAR data, and we evaluate it across three real-world datasets collected using a commercialized serving robot. Experimental results show that applying stricter uncertainty thresholds consistently improves pose accuracy. Specifically, the mean position error is reduced by 41.0%, 56.7%, and 69.4%, and the mean orientation error by 55.6%, 65.7%, and 73.3%, when applying 90%, 80%, and 70% thresholds, respectively. Furthermore, the rejection strategy effectively removes extreme outliers, resulting in better alignment with ground truth trajectories. To the best of our knowledge, this is the first study to quantitatively demonstrate the benefits of percentile-based uncertainty rejection in multi-modal end-to-end localization tasks. Our approach provides a practical means to enhance the reliability and accuracy of localization systems in real-world deployments.

arxiv情報

著者	Hye-Min Won,Jieun Lee,Jiyong Oh
発行日	2025-04-22 11:34:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Normal-guided Detail-Preserving Neural Implicit Function for High-Fidelity 3D Surface Reconstruction

投稿日: 2025年4月23日作成者: jarxiv

要約

神経暗黙の表現は、3D再建の強力なパラダイムとして浮上しています。
ただし、その成功にもかかわらず、既存の方法は、特に目的のオブジェクトのまばらなマルチビューRGB画像のみが利用可能なシナリオで、細かい幾何学的な詳細や薄い構造をキャプチャできません。
このホワイトペーパーでは、1次差特性（表面正規）を使用した神経表現のトレーニングは、2つのRGB画像がわずかにある場合でも、非常に正確な3D表面再構成につながることを示しています。
入力RGB画像を使用して、既製のモノクラー深度推定器によって生成された深度マップからの近似地下面の表面正規項を計算します。
トレーニング中に、SDFネットワークの表面点を直接見つけ、深度マップから推定されたものを使用して通常の監督を行います。
広範な実験は、私たちの方法が最小限の数のビューで最先端の再構成の精度を達成し、以前にキャプチャするのが困難だった複雑な幾何学的な詳細と薄い構造をキャプチャすることを示しています。

要約(オリジナル)

Neural implicit representations have emerged as a powerful paradigm for 3D reconstruction. However, despite their success, existing methods fail to capture fine geometric details and thin structures, especially in scenarios where only sparse multi-view RGB images of the objects of interest are available. This paper shows that training neural representations with first-order differential properties (surface normals) leads to highly accurate 3D surface reconstruction, even with as few as two RGB images. Using input RGB images, we compute approximate ground-truth surface normals from depth maps produced by an off-the-shelf monocular depth estimator. During training, we directly locate the surface point of the SDF network and supervise its normal with the one estimated from the depth map. Extensive experiments demonstrate that our method achieves state-of-the-art reconstruction accuracy with a minimal number of views, capturing intricate geometric details and thin structures that were previously challenging to capture.

arxiv情報

著者	Aarya Patel,Hamid Laga,Ojaswa Sharma
発行日	2025-04-22 11:40:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.GR, I.3.5 | コメントを受け付けていません

Human-Imperceptible Physical Adversarial Attack for NIR Face Recognition Models

投稿日: 2025年4月23日作成者: jarxiv

要約

低光の状態または化粧の存在下で効果的に動作できる、近赤外（NIR）フェース認識システムは、物理的な敵対的攻撃を受けると脆弱性を示します。
実際のアプリケーションの潜在的なリスクをさらに実証するために、ブラックボックスの設定でNIRフェース認識システムを攻撃するために、新規でステルス的で実用的な敵対的なパッチを設計します。
これを達成し、人間に感受性のない赤外線吸収インクを利用して、デジタルで最適化された形状と赤外線画像の位置を備えた複数のパッチを生成しました。
デジタルと現実世界のNIRイメージングの間の最適化の不一致に対処するために、NIR光反射をシミュレートすることによりピクセルレベルの矛盾を最小限に抑えるために、人間の皮膚の光反射モデルを開発します。
NIRの顔認識システムに対する最先端（SOTA）の物理的攻撃と比較して、実験結果は、この方法がデジタルドメインと物理ドメインの両方で攻撃成功率を改善し、特にさまざまな顔の姿勢で効果を維持することを示しています。
特に、提案されたアプローチはSOTAメソッドを上回り、既存の方法では64.18％と比較して、異なるモデルで物理ドメインで82.46％の平均攻撃成功率を達成します。
アーティファクトは、https：//anonymous.4open.science/r/human-imperceptible-adversarial-patch-0703/で入手できます。

要約(オリジナル)

Near-infrared (NIR) face recognition systems, which can operate effectively in low-light conditions or in the presence of makeup, exhibit vulnerabilities when subjected to physical adversarial attacks. To further demonstrate the potential risks in real-world applications, we design a novel, stealthy, and practical adversarial patch to attack NIR face recognition systems in a black-box setting. We achieved this by utilizing human-imperceptible infrared-absorbing ink to generate multiple patches with digitally optimized shapes and positions for infrared images. To address the optimization mismatch between digital and real-world NIR imaging, we develop a light reflection model for human skin to minimize pixel-level discrepancies by simulating NIR light reflection. Compared to state-of-the-art (SOTA) physical attacks on NIR face recognition systems, the experimental results show that our method improves the attack success rate in both digital and physical domains, particularly maintaining effectiveness across various face postures. Notably, the proposed approach outperforms SOTA methods, achieving an average attack success rate of 82.46% in the physical domain across different models, compared to 64.18% for existing methods. The artifact is available at https://anonymous.4open.science/r/Human-imperceptible-adversarial-patch-0703/.

arxiv情報

著者	Songyan Xie,Jinghang Wen,Encheng Su,Qiucheng Yu
発行日	2025-04-22 12:10:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

FocusedAD: Character-centric Movie Audio Description

投稿日: 2025年4月23日作成者: jarxiv

要約

映画オーディオの説明（AD）は、対話のないセグメント中に視覚的なコンテンツをナレーションすることを目的としています。
一般的なビデオキャプションと比較して、ADは明示的な文字名の参照を備えたプロットに関連するナレーションを要求し、映画の理解に独特の課題を提起します。アクティブなメインキャラクターを特定し、ストーリーに関連する地域に焦点を当てるために、キャラクター中心の映画のオーディオ記述を提供する新しいフレームワークであるFocusedadを提案します。
（i）文字領域を追跡し、名前にリンクするための文字知覚モジュール（CPM）。
（ii）学習可能なソフトプロンプトを介して以前の広告および字幕からコンテキストキューを注入する動的な事前モジュール（DPM）。
（iii）プロット関連の詳細と名前付き文字で豊富なナレーションを生成する焦点を絞ったキャプションモジュール（FCM）。
文字識別の制限を克服するために、文字クエリバンクを構築するための自動パイプラインも導入します。
Focusedadは、MAD-EvalNamedの強力なゼロショット結果や、新しく提案されているシネピルADデータセットの強力なゼロショット結果を含む、複数のベンチマークで最先端のパフォーマンスを実現しています。
コードとデータはhttps://github.com/thorin215/focusedadでリリースされます。

要約(オリジナル)

Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie understanding.To identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at https://github.com/Thorin215/FocusedAD .

arxiv情報

著者	Xiaojun Ye,Chun Wang,Yiren Song,Sheng Zhou,Liangcheng Li,Jiajun Bu
発行日	2025-04-22 12:25:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, I.2.10 | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント