jarxiv | Japanese arxiv | ページ 225

Deep Learning for Retinal Degeneration Assessment: A Comprehensive Analysis of the MARIO AMD Progression Challenge

投稿日: 2025年6月4日作成者: jarxiv

要約

MICCAI 2024で開催されたMARIOチャレンジは、光干渉断層計（OCT）画像の解析による加齢黄斑変性（AMD）の自動検出とモニタリングの進歩に焦点を当てた。AMDにおける新生血管の活動変化を検出するアルゴリズムの性能を評価するために設計されたこの課題には、ユニークなマルチモダルデータセットが組み込まれた。フランスのブレストから提供された主要データセットは、参加チームがモデルの訓練とテストに使用した。最終的なランキングは、このデータセットでのパフォーマンスに基づいて決定された。アルジェリアからの補助データセットは、提出されたソリューションからの人口とデバイスのシフトを評価するために、チャレンジ後に使用された。MARIOチャレンジには2つのタスクがあった。1つ目は、2つの連続した2D OCT Bスキャン間の進化の分類である。もう1つは、抗血管内皮増殖因子（VEGF）療法を受けている患者の3ヵ月にわたる将来のAMDの進展予測であった。35チームが参加し、最終選考に残った上位12チームがその方法を発表した。本論文では、OCT、赤外画像、臨床データ（受診回数、年齢、性別など）を用いたAMDモニタリングのベンチマークを設定し、チャレンジの構成、課題、データの特徴、受賞方法について概説する。このチャレンジの結果は、人工知能（AI）がAMD進行の測定（タスク1）において医師と同等の性能を発揮するが、将来の進化（タスク2）を予測することはまだできないことを示している。

要約(オリジナル)

The MARIO challenge, held at MICCAI 2024, focused on advancing the automated detection and monitoring of age-related macular degeneration (AMD) through the analysis of optical coherence tomography (OCT) images. Designed to evaluate algorithmic performance in detecting neovascular activity changes within AMD, the challenge incorporated unique multi-modal datasets. The primary dataset, sourced from Brest, France, was used by participating teams to train and test their models. The final ranking was determined based on performance on this dataset. An auxiliary dataset from Algeria was used post-challenge to evaluate population and device shifts from submitted solutions. Two tasks were involved in the MARIO challenge. The first one was the classification of evolution between two consecutive 2D OCT B-scans. The second one was the prediction of future AMD evolution over three months for patients undergoing anti-vascular endothelial growth factor (VEGF) therapy. Thirty-five teams participated, with the top 12 finalists presenting their methods. This paper outlines the challenge’s structure, tasks, data characteristics, and winning methodologies, setting a benchmark for AMD monitoring using OCT, infrared imaging, and clinical data (such as the number of visits, age, gender, etc.). The results of this challenge indicate that artificial intelligence (AI) performs as well as a physician in measuring AMD progression (Task 1) but is not yet able of predicting future evolution (Task 2).

arxiv情報

著者	Rachid Zeghlache,Ikram Brahim,Pierre-Henri Conze,Mathieu Lamard,Mohammed El Amine Lazouni,Zineb Aziza Elaouaber,Leila Ryma Lazouni,Christopher Nielsen,Ahmad O. Ahsan,Matthias Wilms,Nils D. Forkert,Lovre Antonio Budimir,Ivana Matovinović,Donik Vršnak,Sven Lončarić,Philippe Zhang,Weili Jiang,Yihao Li,Yiding Hao,Markus Frohmann,Patrick Binder,Marcel Huber,Taha Emre,Teresa Finisterra Araújo,Marzieh Oghbaie,Hrvoje Bogunović,Amerens A. Bekkers,Nina M. van Liebergen,Hugo J. Kuijf,Abdul Qayyum,Moona Mazher,Steven A. Niederer,Alberto J. Beltrán-Carrero,Juan J. Gómez-Valverde,Javier Torresano-Rodríquez,Álvaro Caballero-Sastre,María J. Ledesma Carbayo,Yosuke Yamagishi,Yi Ding,Robin Peretzke,Alexandra Ertl,Maximilian Fischer,Jessica Kächele,Sofiane Zehar,Karim Boukli Hacene,Thomas Monfort,Béatrice Cochener,Mostafa El Habib Daho,Anas-Alexis Benyoussef,Gwenolé Quellec
発行日	2025-06-03 15:14:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters

投稿日: 2025年6月4日作成者: jarxiv

要約

近年、音声駆動型ヒューマン・アニメーションが大きく進歩している。しかし、(i)キャラクタの一貫性を保ちながら非常にダイナミックな動画を生成すること、(ii)キャラクタと音声の間で正確な感情の位置合わせを実現すること、(iii)多キャラクタの音声駆動アニメーションを可能にすること、には重大な課題が残されている。これらの課題を解決するために、我々は、動的で、感情制御可能で、多キャラクターの対話動画を同時に生成できるマルチモーダル拡散変換器（MM-DiT）ベースのモデルであるHunyuanVideo-Avatarを提案する。具体的には、HunyuanVideo-Avatarは3つの重要な革新的技術を導入している。(i) 文字画像注入モジュールは、従来の加算ベースの文字条件付けスキームに取って代わるように設計されており、学習と推論の間に内在する条件の不一致を解消する。(ii)オーディオエモーションモジュール(AEM)が導入され、感情参照画像からターゲット生成ビデオに感情キューを抽出して転送し、きめ細かく正確な感情スタイル制御を可能にする。(iii)フェイスアウェアオーディオアダプタ(FAA)が提案され、オーディオ駆動キャラクタを潜在レベルのフェイスマスクで分離し、マルチキャラクタシナリオのクロスアテンションによる独立したオーディオ注入を可能にする。これらの革新的な技術により、HunyuanVideo-Avatarはベンチマークデータセットと新たに提案された野生データセットにおいて最先端の手法を凌駕し、ダイナミックで没入感のあるシナリオにおいてリアルなアバターを生成する。

要約(オリジナル)

Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video, enabling fine-grained and accurate emotion style control; (iii) A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask, enabling independent audio injection via cross-attention for multi-character scenarios. These innovations empower HunyuanVideo-Avatar to surpass state-of-the-art methods on benchmark datasets and a newly proposed wild dataset, generating realistic avatars in dynamic, immersive scenarios.

arxiv情報

著者	Yi Chen,Sen Liang,Zixiang Zhou,Ziyao Huang,Yifeng Ma,Junshu Tang,Qin Lin,Yuan Zhou,Qinglin Lu
発行日	2025-06-03 15:15:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV | コメントを受け付けていません

Astrophotography turbulence mitigation via generative models

投稿日: 2025年6月4日作成者: jarxiv

要約

写真撮影は、現代の天文学および宇宙研究の要である。しかし、地上の望遠鏡で撮影された天文画像のほとんどは、大気の乱れに悩まされ、その結果、画像品質が低下する。ラッキーイメージングのようなマルチフレームストラテジーは、その影響を軽減することはできるが、データ取得に多大な労力を要し、複雑なマニュアル処理が必要となる。本論文では、大気の乱れを緩和するために、拡散モデルの高品質な生成プライアと復元能力の両方を活用する生成復元法であるAstroDiffを提案する。広範な実験により、AstroDiffは、天体画像の乱流緩和において、既存の最先端の学習ベースの手法を凌駕し、厳しい乱流条件下でより高い知覚品質とより優れた構造忠実度を提供することが実証された。我々のコードとその他の結果は、https://web-six-kappa-66.vercel.app/ でご覧いただけます。

要約(オリジナル)

Photography is the cornerstone of modern astronomical and space research. However, most astronomical images captured by ground-based telescopes suffer from atmospheric turbulence, resulting in degraded imaging quality. While multi-frame strategies like lucky imaging can mitigate some effects, they involve intensive data acquisition and complex manual processing. In this paper, we propose AstroDiff, a generative restoration method that leverages both the high-quality generative priors and restoration capabilities of diffusion models to mitigate atmospheric turbulence. Extensive experiments demonstrate that AstroDiff outperforms existing state-of-the-art learning-based methods in astronomical image turbulence mitigation, providing higher perceptual quality and better structural fidelity under severe turbulence conditions. Our code and additional results are available at https://web-six-kappa-66.vercel.app/

arxiv情報

著者	Joonyeoup Kim,Yu Yuan,Xingguang Zhang,Xijun Wang,Stanley Chan
発行日	2025-06-03 15:18:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Learning on Model Weights using Tree Experts

投稿日: 2025年6月4日作成者: jarxiv

要約

公開されているモデルの数は急速に増えているが、そのほとんどは文書化されていない。タスクに適したモデルを探しているユーザーは、まず各モデルが何をするのかを判断しなければならない。機械学習モデルをトレーニングして、モデルの重みから不足しているドキュメントを直接推測することは困難である。ここで、我々は実世界のモデルの重要な特性を特定する。すなわち、ほとんどの公開モデルは小さなモデル・ツリーの集合に属し、ツリー内のすべてのモデルは共通の祖先（例えば、基礎モデル）から微調整されている。重要なことは、各ツリー内では、モデル間の厄介なばらつきが少ないということです。具体的には、モデル・ツリー間の学習には複雑なアーキテクチャが必要ですが、単一のモデル・レイヤーで訓練された線形分類器でも、ツリー内で機能することがよくあります。効果的ではあるが、このような線形分類器は計算コストが高く、特に多くのパラメータを持つ大規模なモデルを扱う場合には、計算コストが高くなる。これに対処するため、理論的に動機づけられた軽量な手法であるProbing Experts (ProbeX)を紹介する。ProbeXは、特に単一の隠れモデル層の重みから学習するように設計された最初のプロービング手法である。我々は、モデルの学習データセットのカテゴリをその重みのみに基づいて予測することにより、ProbeXの有効性を実証する。興味深いことに、ProbeXはStable Diffusionの重みを重み-言語埋め込み空間にマッピングすることができ、テキストによるモデル検索、すなわちゼロショットモデル分類を可能にする。

要約(オリジナル)

The number of publicly available models is rapidly increasing, yet most remain undocumented. Users looking for suitable models for their tasks must first determine what each model does. Training machine learning models to infer missing documentation directly from model weights is challenging, as these weights often contain significant variation unrelated to model functionality (denoted nuisance). Here, we identify a key property of real-world models: most public models belong to a small set of Model Trees, where all models within a tree are fine-tuned from a common ancestor (e.g., a foundation model). Importantly, we find that within each tree there is less nuisance variation between models. Concretely, while learning across Model Trees requires complex architectures, even a linear classifier trained on a single model layer often works within trees. While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. Notably, ProbeX is the first probing method specifically designed to learn from the weights of a single hidden model layer. We demonstrate the effectiveness of ProbeX by predicting the categories in a model’s training dataset based only on its weights. Excitingly, ProbeX can map the weights of Stable Diffusion into a weight-language embedding space, enabling model search via text, i.e., zero-shot model classification.

arxiv情報

著者	Eliahu Horwitz,Bar Cavia,Jonathan Kahana,Yedid Hoshen
発行日	2025-06-03 15:42:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

PartComposer: Learning and Composing Part-Level Concepts from Single-Image Examples

投稿日: 2025年6月4日作成者: jarxiv

要約

我々はPartComposerを発表する：テキストから画像への拡散モデルが、意味のある構成要素から新しいオブジェクトを構成することを可能にする、単一画像の例から部分レベルの概念を学習するフレームワークである。既存の手法では、きめ細かい概念を効果的に学習するのに苦労するか、入力として大規模なデータセットを必要とする。我々は、一回限りのデータ不足に対処するために、多様な部品構成を生成する動的データ合成パイプラインを提案する。最も重要な点として、概念予測器を介して、ノイズ除去された潜在データと構造化された概念コードとの間の相互情報を最大化することを提案し、概念の分離と再構成の監督を直接制御することを可能にする。本手法は強力な分離と制御可能な合成を達成し、同じ、あるいは異なるオブジェクトカテゴリからの概念を混合する場合に、主題レベルや部品レベルのベースラインを凌駕する。

要約(オリジナル)

We present PartComposer: a framework for part-level concept learning from single-image examples that enables text-to-image diffusion models to compose novel objects from meaningful components. Existing methods either struggle with effectively learning fine-grained concepts or require a large dataset as input. We propose a dynamic data synthesis pipeline generating diverse part compositions to address one-shot data scarcity. Most importantly, we propose to maximize the mutual information between denoised latents and structured concept codes via a concept predictor, enabling direct regulation on concept disentanglement and re-composition supervision. Our method achieves strong disentanglement and controllable composition, outperforming subject and part-level baselines when mixing concepts from the same, or different, object categories.

arxiv情報

著者	Junyu Liu,R. Kenny Jones,Daniel Ritchie
発行日	2025-06-03 15:43:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV, cs.GR | コメントを受け付けていません

DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models

投稿日: 2025年6月4日作成者: jarxiv

要約

生成モデルの急速な進歩に伴い、AIが生成する画像のリアリズムは著しく向上しており、デジタルコンテンツの真正性を検証する上で重要な課題となっている。現在のディープフェイク検出手法は、生成モデルやコンテンツの多様性が限定されたデータセットに依存することが多く、AIが生成するコンテンツの複雑性やリアルさの進化に対応できていない。様々な視覚タスクで広く採用されている大規模マルチモーダルモデル（LMM）は、強力なゼロショット能力を実証していますが、ディープフェイク検出におけるその可能性はまだほとんど未解明です。このギャップを埋めるために、我々は、(i) 実コンテンツ、AI編集コンテンツ、AI生成コンテンツの54万画像を含む幅広い多様性、(ii) 最新のモデル、12個の最新生成モデルによって生成された偽画像、(iii) ディープフェイク検出器の検出精度と生成モデルの回避能力の双方向のベンチマークと評価を特徴とする、大規模なDeepFake BenchmarkであるDFBenchを発表する。DFBenchに基づき、我々は、複数のLMMからの複合確率戦略を活用した、DeepFake検出のための混合エージェント（MoA-DF}）を提案する。MoA-DFは最先端の性能を達成し、ディープフェイク検出のためにLMMを活用することの有効性をさらに証明した。データベースとコードは https://github.com/IntMeGroup/DFBench で公開されている。

要約(オリジナル)

With the rapid advancement of generative models, the realism of AI-generated images has significantly improved, posing critical challenges for verifying digital content authenticity. Current deepfake detection methods often depend on datasets with limited generation models and content diversity that fail to keep pace with the evolving complexity and increasing realism of the AI-generated content. Large multimodal models (LMMs), widely adopted in various vision tasks, have demonstrated strong zero-shot capabilities, yet their potential in deepfake detection remains largely unexplored. To bridge this gap, we present \textbf{DFBench}, a large-scale DeepFake Benchmark featuring (i) broad diversity, including 540,000 images across real, AI-edited, and AI-generated content, (ii) latest model, the fake images are generated by 12 state-of-the-art generation models, and (iii) bidirectional benchmarking and evaluating for both the detection accuracy of deepfake detectors and the evasion capability of generative models. Based on DFBench, we propose \textbf{MoA-DF}, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. MoA-DF achieves state-of-the-art performance, further proving the effectiveness of leveraging LMMs for deepfake detection. Database and codes are publicly available at https://github.com/IntMeGroup/DFBench.

arxiv情報

著者	Jiarui Wang,Huiyu Duan,Juntong Wang,Ziheng Jia,Woo Yi Yang,Xiaorong Zhu,Yu Zhao,Jiaying Qian,Yuke Xing,Guangtao Zhai,Xiongkuo Min
発行日	2025-06-03 15:45:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV | コメントを受け付けていません

Smartflow: Enabling Scalable Spatiotemporal Geospatial Research

投稿日: 2025年6月4日作成者: jarxiv

要約

BlackSkyは、オープンソースのツールやテクノロジーをベースに構築された、スケーラブルな時空間地理空間研究を可能にするクラウドベースのフレームワーク、Smartflowを紹介します。STAC準拠のカタログを共通の入力として使用することで、異種地理空間データを標準化されたデータキューブに処理し、解析やモデルのトレーニングを行うことができます。モデル実験は、ClearML、Tensorboard、Apache Supersetなどのツールを組み合わせて管理される。Smartflowを支えるKubernetesは、ワークフローのプロビジョニングと実行をオーケストレーションし、水平および垂直スケーラビリティをサポートします。この機能の組み合わせにより、Smartflowは大規模な地理的エリア、時間スケール、膨大な画像アーカイブを対象とした地理空間モデルの開発と解析に適しています。また、Smartflowを使用して構築された、大規模な地理的エリアの重建設を監視するための新しいニューラル・アーキテクチャも紹介します。IARPAのSpace-based Machine Automated Recognition Technique（SMART）プログラムのデータに基づく定性的な結果を示し、このモデルが開発のすべての主要な段階を通じて重建設を検出できることを示します。

要約(オリジナル)

BlackSky introduces Smartflow, a cloud-based framework enabling scalable spatiotemporal geospatial research built on open-source tools and technologies. Using STAC-compliant catalogs as a common input, heterogeneous geospatial data can be processed into standardized datacubes for analysis and model training. Model experimentation is managed using a combination of tools, including ClearML, Tensorboard, and Apache Superset. Underpinning Smartflow is Kubernetes, which orchestrates the provisioning and execution of workflows to support both horizontal and vertical scalability. This combination of features makes Smartflow well-suited for geospatial model development and analysis over large geographic areas, time scales, and expansive image archives. We also present a novel neural architecture, built using Smartflow, to monitor large geographic areas for heavy construction. Qualitative results based on data from the IARPA Space-based Machine Automated Recognition Technique (SMART) program are presented that show the model is capable of detecting heavy construction throughout all major phases of development.

arxiv情報

著者	David McVicar,Brian Avant,Adrian Gould,Diego Torrejon,Charles Della Porta,Ryan Mukherjee
発行日	2025-06-03 15:58:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

We Should Chart an Atlas of All the World’s Models

投稿日: 2025年6月4日作成者: jarxiv

要約

公開モデルリポジトリには、現在数百万ものモデルが含まれていますが、ほとんどのモデルは文書化されておらず、事実上失われたままです。このポジションペーパーでは、モデルアトラスと呼ぶ統一された構造で、世界のモデル集団を図にすることを提唱します。モデルアトラスは、モデルフォレンジック、メタML研究、モデル発見への応用を可能にします。しかし、ほとんどのモデルにはドキュメントがないため、アトラスの大きな領域は未知のままです。このギャップに対処するために、モデルそのものをデータとして扱い、その重みから機能性、性能、系統などの特性を直接推測する新しい機械学習手法が動機づけられている。我々は、スケーラブルな前進の道は、モデルの重みを悩ませる独特のパラメータ対称性を回避することであると主張する。世界中のモデルをチャート化するには、コミュニティの努力が必要である。

要約(オリジナル)

Public model repositories now contain millions of models, yet most models remain undocumented and effectively lost. In this position paper, we advocate for charting the world’s model population in a unified structure we call the Model Atlas: a graph that captures models, their attributes, and the weight transformations that connect them. The Model Atlas enables applications in model forensics, meta-ML research, and model discovery, challenging tasks given today’s unstructured model repositories. However, because most models lack documentation, large atlas regions remain uncharted. Addressing this gap motivates new machine learning methods that treat models themselves as data, inferring properties such as functionality, performance, and lineage directly from their weights. We argue that a scalable path forward is to bypass the unique parameter symmetries that plague model weights. Charting all the world’s models will require a community effort, and we hope its broad utility will rally researchers toward this goal.

arxiv情報

著者	Eliahu Horwitz,Nitzan Kurer,Jonathan Kahana,Liel Amar,Yedid Hoshen
発行日	2025-06-03 16:28:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CL, cs.CV, cs.LG | コメントを受け付けていません

Adversarial Robustness of AI-Generated Image Detectors in the Real World

投稿日: 2025年6月4日作成者: jarxiv

要約

ジェネレーティブ・アーティフィシャル・インテリジェンス（GenAI）機能の急速な進歩は、その悪用の増加という問題を伴っている。特に、画像という形で信憑性の高い誤情報を生成することは、民主的プロセスに対する国民の信頼に重大な脅威をもたらす。その結果、本物のコンテンツとAIが生成したコンテンツを確実に区別するツールの開発が急務となっている。検出手法の大半は、フォレンジック・アーティファクトを認識するように訓練されたニューラルネットワークに基づいている。本研究では、現在の最先端の分類器が、実世界の条件下では敵対的な例に対して脆弱であることを実証する。4つの検出方法と5つの攻撃アルゴリズムからなる広範な実験を通じて、攻撃者が検出器のアーキテクチャを内部的に知ることなく、分類性能を劇的に低下させることができることを示す。注目すべきことに、ほとんどの攻撃は、例えばソーシャルメディア・プラットフォームへのアップロード中に画像が劣化した場合でも有効なままである。ケーススタディでは、独自のオンラインGenAIメディア検出器であるHIVEに対してブラックボックス攻撃を行うことで、これらの堅牢性の課題が商用ツールでも見られることを実証する。さらに、ロバストな事前訓練モデルの生成された特徴量を使用した場合のロバスト性を評価し、これによりロバスト性が向上する一方で、良性入力に対する性能には達しないことを示した。これらの結果は、GenAIが社会的信用を損なう可能性が高まっていることと共に、GenAIの悪用を防止する手法に関するさらなる研究と新たな視点の必要性を強調している。

要約(オリジナル)

The rapid advancement of Generative Artificial Intelligence (GenAI) capabilities is accompanied by a concerning rise in its misuse. In particular the generation of credible misinformation in the form of images poses a significant threat to the public trust in democratic processes. Consequently, there is an urgent need to develop tools to reliably distinguish between authentic and AI-generated content. The majority of detection methods are based on neural networks that are trained to recognize forensic artifacts. In this work, we demonstrate that current state-of-the-art classifiers are vulnerable to adversarial examples under real-world conditions. Through extensive experiments, comprising four detection methods and five attack algorithms, we show that an attacker can dramatically decrease classification performance, without internal knowledge of the detector’s architecture. Notably, most attacks remain effective even when images are degraded during the upload to, e.g., social media platforms. In a case study, we demonstrate that these robustness challenges are also found in commercial tools by conducting black-box attacks on HIVE, a proprietary online GenAI media detector. In addition, we evaluate the robustness of using generated features of a robust pre-trained model and showed that this increases the robustness, while not reaching the performance on benign inputs. These results, along with the increasing potential of GenAI to erode public trust, underscore the need for more research and new perspectives on methods to prevent its misuse.

arxiv情報

著者	Sina Mavali,Jonas Ricker,David Pape,Asja Fischer,Lea Schönherr
発行日	2025-06-03 16:40:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

投稿日: 2025年6月4日作成者: jarxiv

要約

拡散変換(DiT)はビデオ生成において画期的な進歩を遂げたが、この長いシーケンス生成タスクは注意メカニズムの2次的な複雑さに制約されたままであり、その結果、推論待ち時間が大きくなっている。Video Diffusion Transformer (vDiT)における注意マップの詳細な解析を通して、我々は3つの繰り返し発生するスパースパターンを特定する。また、3-6%の注意ヘッドでさえもスキップすることができる。重要なことは、これらのパターンは強い層深度とヘッド位置の相関を示すが、入力内容への依存性は限定的であることである。これらの発見を活用して、我々はvDiTのためのスパースアクセラレーションフレームワークであるSparse-vDiTを提案する：1)パターン最適化されたスパースカーネルは、同定された各スパースパターンに対して計算効率の良い実装で密な注意を置き換える。2) ハードウェアを考慮したコストモデリングにより、レイヤーとヘッドごとに最適なスパース計算戦略を選択するオフラインスパース拡散探索アルゴリズム。最適な構成を決定した後、同じ注意戦略を共有する同じレイヤー内のヘッドを融合し、推論効率を高める。最新のvDiTモデル（CogVideoX1.5、HunyuanVideo、 Wan2.1）に統合されたSparse-vDiTは、理論的なFLOP削減を2.09$times$、 2.38$times$、1.67$times$達成し、実際の推論スピードアップを1.76$times$、1.85$times$、1.58$times$の実際の推論高速化を実現し、PSNR値は24.13、27.09、22.59に達し、高い視覚的忠実性を維持する。我々の研究は、vDiTにおける潜在的な構造的スパース性を、長いビデオ合成のために系統的に利用できることを示している。

要約(オリジナル)

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09$\times$, 2.38$\times$, and 1.67$\times$ theoretical FLOP reduction, and actual inference speedups of 1.76$\times$, 1.85$\times$, and 1.58$\times$, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

arxiv情報

著者	Pengtao Chen,Xianfang Zeng,Maosen Zhao,Peng Ye,Mingzhu Shen,Wei Cheng,Gang Yu,Tao Chen
発行日	2025-06-03 16:42:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント