jarxiv | Japanese arxiv

Semi-Automated Quality Assurance in Digital Pathology: Tile Classification Approach

投稿日: 2025年6月13日作成者: jarxiv

要約

品質保証は、マイナーなアーティファクトでさえ大きな効果をもたらす可能性のあるデジタル病理学において、重要ではあるが未使用の領域です。
アーティファクトは、AI診断モデルのパフォーマンスに悪影響を与えることが示されています。
現在の実践では、訓練を受けたスタッフがこれらのスライドを病理学者にリリースする前に、デジタル化された画像を手動でレビューし、その後診断を行うために使用されます。
従来の画像処理アプローチは、デジタル病理スライドのアーティファクトを検出するための基盤を提供します。
ただし、現在のツールは深い学習を活用していないため、検出の精度とスケーラビリティを改善する可能性があります。
これらの進歩にもかかわらず、デジタル病理学における品質保証の方法は限られたままであり、イノベーションのギャップを提示します。
タイルを分析し、それらを10個の事前定義されたアーティファクトタイプのいずれかまたは背景として分類することにより、デジタル病理スライドをスクリーニングするように設計されたAIアルゴリズムを提案します。
このアルゴリズムは、アーティファクトを識別およびローカライズし、関心のある領域を強調するマップを作成します。
アルゴリズムは、人間のオペレーターをアーティファクトの影響を受けた特定のタイルに指示することにより、品質の問題についてスライド全体を手動で確認するために必要な時間と労力を最小限に抑えます。
内部アーカイブとがんゲノムアトラスから、133個の全体のスライド画像が選択され、内部開発ソフトウェアZAPP（フロリダ州ジャクソンビル）を使用して10個のアーティファクトに注釈を付けました。
異なるタイルサイズと倍率での複数のモデルのアブレーション研究が実行されました。
InceptionResnetが選択されました。
単一のアーティファクトモデルをトレーニングおよびテストし、それに続いて、一緒に機能するアーティファクトを備えた限られた複数のインスタンスモデル（おしゃべり、折りたたみ、ペン）が続きました。
この研究の結果から、単一のアーティファクトバイナリモデルと複数のインスタンスモデルの両方で構成されるアーティファクトスクリーニングのハイブリッド設計をお勧めします。

要約(オリジナル)

Quality assurance is a critical but underexplored area in digital pathology, where even minor artifacts can have significant effects. Artifacts have been shown to negatively impact the performance of AI diagnostic models. In current practice, trained staff manually review digitized images prior to release of these slides to pathologists which are then used to render a diagnosis. Conventional image processing approaches, provide a foundation for detecting artifacts on digital pathology slides. However, current tools do not leverage deep learning, which has the potential to improve detection accuracy and scalability. Despite these advancements, methods for quality assurance in digital pathology remain limited, presenting a gap for innovation. We propose an AI algorithm designed to screen digital pathology slides by analyzing tiles and categorizing them into one of 10 predefined artifact types or as background. This algorithm identifies and localizes artifacts, creating a map that highlights regions of interest. By directing human operators to specific tiles affected by artifacts, the algorithm minimizes the time and effort required to manually review entire slides for quality issues. From internal archives and The Cancer Genome Atlas, 133 whole slide images were selected and 10 artifacts were annotated using an internally developed software ZAPP (Mayo Clinic, Jacksonville, FL). Ablation study of multiple models at different tile sizes and magnification was performed. InceptionResNet was selected. Single artifact models were trained and tested, followed by a limited multiple instance model with artifacts that performed well together (chatter, fold, and pen). From the results of this study we suggest a hybrid design for artifact screening composed of both single artifact binary models as well as multiple instance models to optimize detection of each artifact.

arxiv情報

著者	Meredith VandeHaar,M. Clinch,I. Yilmaz,M. A. Rahman,Y. Xiao,F. Dogany,H. M. Alazab,A. Nassar,Z. Akkus,B. Dangott
発行日	2025-06-12 17:30:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.IV | コメントを受け付けていません

Visually Descriptive Language Model for Vector Graphics Reasoning

投稿日: 2025年6月13日作成者: jarxiv

要約

大きな進歩にもかかわらず、大規模なマルチモーダルモデル（LMM）は、形状、サイズ、レイアウトに焦点を当てた低レベルの視覚的知覚と、セマンティクスやロジックなどの高レベルの言語推論の間のギャップを埋めるのに苦労しています。
この制限は、幾何学的特性の比較や視覚的推論の問題を解決するなど、正確な視覚的認識を必要とするタスクで明らかです。
この障害モードを研究するために、ベクトルグラフィックス – ウェブ、デザイン、OS環境のLMMベースのタスクで一般的な2Dオブジェクトと形状で構成される画像に焦点を当てます。
2つの重要な研究の質問を特定します。正確な視覚的認識をどのように有効にすることができ、このような低レベルの認識に基づいて高レベルの推論を促進するにはどうすればよいですか？
細かい視覚的な詳細をキャプチャするために、視覚シーンの正確なエンコードにスケーラブルベクトルグラフィックス（SVG）を使用します。
ただし、SVGはLMMSによってゼロショットの方法で容易に解釈できません。
これに取り組むために、視覚的に説明的な言語モデル（VDLM）を提案します。これにより、中間のテキスト表現としてPrimal Visual Actions（PVD）を紹介します。
PVDは、SVGをプリミティブ属性（形状、位置、測定など）と対応する値で構成されるテキストベースの抽象化に変換します。
PVDは、タスクに依存しない合成データを使用して学習でき、ベクトルグラフィックス全体で普遍的な視覚的なプリミティブを表します。
この抽象化はより構造化されており、ゼロショット一般化のための基礎モデルによる直接的な解釈が可能になります。
人間が解決したデータがなければ、経験的結果は、VDLMがさまざまなマルチモーダル認識や推論タスクでGPT-4Oのような最先端のLMMを大幅に改善することを示しています。
VDLMの広範な分析は、その脱茎の認識と推論により、解釈性が向上したことを示しています。
また、PVDの品質とタスクのパフォーマンスとの間に正の相関関係を示します。
プロジェクトページ：https：//mikewangwzhl.github.io/vdlm/

要約(オリジナル)

Despite significant advancements, large multimodal models (LMMs) still struggle to bridge the gap between low-level visual perception — focusing on shapes, sizes, and layouts — and high-level language reasoning, such as semantics and logic. This limitation is evident in tasks that require precise visual perception, like comparing geometric properties or solving visual reasoning problems. To study this failure mode, we focus on vector graphics — images composed of 2D objects and shapes, prevalent in LMM-based tasks in web, design, and OS environments. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To capture fine visual details, we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes. However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To tackle this, we propose the Visually Descriptive Language Model (VDLM), which introduces a Primal Visual Description (PVD) as an intermediate textual representation. PVD translates SVGs into a text-based abstraction consisting of primitive attributes (e.g., shape, position, measurement) and their corresponding values. PVD can be learned using task-agnostic synthesized data and represents visual primitives that are universal across vector graphics. This abstraction is more structured, allowing for direct interpretation by foundation models for zero-shot generalization. Without human-annotated data, empirical results show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks. Extensive analyses of VDLM show improved interpretability due to its disentangled perception and reasoning. We also demonstrate a positive correlation between PVD quality and task performance. Project page: https://mikewangwzhl.github.io/VDLM/

arxiv情報

著者	Zhenhailong Wang,Joy Hsu,Xingyao Wang,Kuan-Hao Huang,Manling Li,Jiajun Wu,Heng Ji
発行日	2025-06-12 17:46:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

VINCIE: Unlocking In-context Image Editing from Video

投稿日: 2025年6月13日作成者: jarxiv

要約

コンテキスト内画像編集は、テキストと以前に生成された画像を含むコンテキストシーケンスに基づいて画像を変更することを目的としています。
既存の方法は通常、トレーニングデータをキュレートするためのタスク固有のパイプラインとエキスパートモデル（セグメンテーションやインドピンティングなど）に依存します。
この作業では、ビデオ内の画像編集モデルをビデオから直接学習できるかどうかを調査します。
インターリーブマルチモーダルシーケンスとしてビデオを注釈にするためのスケーラブルなアプローチを紹介します。
このデータから効果的に学習するために、次のイメージ予測、現在のセグメンテーション予測、および次のセグメンテーション予測の3つのプロキシタスクでトレーニングされたブロックコーサル拡散トランスを設計します。
さらに、この分野での研究を進めるために、新しいマルチターン画像編集ベンチマークを提案します。
広範な実験は、モデルが強力なコンテキスト内画像編集機能を示し、2つのマルチターン画像編集ベンチマークで最新の結果を達成することを示しています。
ビデオのみで訓練されているにもかかわらず、私たちのモデルは、マルチコンセプト構成、ストーリー生成、および編集アプリケーションの有望な能力も示しています。

要約(オリジナル)

In-context image editing aims to modify images based on a contextual sequence comprising text and previously generated images. Existing methods typically depend on task-specific pipelines and expert models (e.g., segmentation and inpainting) to curate training data. In this work, we explore whether an in-context image editing model can be learned directly from videos. We introduce a scalable approach to annotate videos as interleaved multimodal sequences. To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks: next-image prediction, current segmentation prediction, and next-segmentation prediction. Additionally, we propose a novel multi-turn image editing benchmark to advance research in this area. Extensive experiments demonstrate that our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks. Despite being trained exclusively on videos, our model also shows promising abilities in multi-concept composition, story generation, and chain-of-editing applications.

arxiv情報

著者	Leigang Qu,Feng Cheng,Ziyan Yang,Qi Zhao,Shanchuan Lin,Yichun Shi,Yicong Li,Wenjie Wang,Tat-Seng Chua,Lu Jiang
発行日	2025-06-12 17:46:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.MM | コメントを受け付けていません

CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

投稿日: 2025年6月13日作成者: jarxiv

要約

Pansharpeningは、高解像度のパンクロマティック（PAN）画像と低解像度のマルチスペクトル（LRMS）画像を融合して、高解像度のマルチスペクトル（HRMS）画像を生成する重要なリモートセンシング技術です。
ディープラーニング技術はパンシャープを大幅に進めていますが、多くの既存の方法は、限られたクロスセンサーの一般化と高い計算オーバーヘッドに悩まされ、リアルタイムアプリケーションを制限しています。
これらの課題に対処するために、特定の入力インスタンスに迅速に適応する効率的なフレームワークを提案し、短時間でトレーニングと推論の両方を完了します。
私たちのフレームワークは、入力画像を複数のパッチに分割し、監視されていない猫トレーニングのサブセットを選択し、すべてのパッチで推論を実行し、最終出力に縫います。
機能抽出と事前に訓練されたネットワークのチャネル変換段階の間に統合されたCATモジュールは、融合機能を調整し、効率的な推論のためにパラメーターを修正し、改善された結果を生成します。
私たちのアプローチは、2つの重要な利点を提供します。（1）$ \ TextIT {一般化能力の改善} $：クロスセンサーの劣化を緩和することにより、モデル – 特定のデータセットで事前に訓練されていますが、他のセンサーによってキャプチャされたデータセットの優れたパフォーマンスを達成します。
（2）$ \ textIT {Enhanced Computational Efficiency} $：CAT強化ネットワークは、大規模なデータ再トレーニングを必要とせずに、単一のLRMS-PANペア入力を使用してテストサンプルに迅速に適応できます。
Worldview-3およびWorldview-2データセットの実際のデータに関する実験は、この方法がクロスセンサーの実世界データで最先端のパフォーマンスを達成し、$ 512 \ Times512 $画像の両方のトレーニングと推論の両方を達成し、$ 4000 $ 4000 $ 4000 $の画像を$ 4000 $ 4000の画像で$ 4000 $ 4000の画像で達成することを示しています。
一般的に使用されるRTX 3090 GPU。

要約(オリジナル)

Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) $\textit{Improved Generalization Ability}$: by mitigating cross-sensor degradation, our model–although pre-trained on a specific dataset–achieves superior performance on datasets captured by other sensors; (2) $\textit{Enhanced Computational Efficiency}$: the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of $512\times512$ image within $\textit{0.4 seconds}$ and $4000\times4000$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU.

arxiv情報

著者	Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng
発行日	2025-06-12 17:48:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

ReGuidance: A Simple Diffusion Wrapper for Boosting Sample Quality on Hard Inverse Problems

投稿日: 2025年6月13日作成者: jarxiv

要約

逆の問題を解決するための情報に基づいたデータ事前に、より一般的には報酬モデルを使用してこれらのモデルを操縦することを中心に、前提条件の拡散モデルを使用して活動が急増しています。
拡散後のサンプリング（DPS）などのトレーニングフリーの方法やその多くのバリエーションは、これらのタスクに柔軟なヒューリスティックアルゴリズムを提供していますが、報酬が十分に有益ではない場合、たとえば、信号対雑音比が低いため、これらの技術が現実的な出力を生成することに失敗したデータのマニフェルを排除します。
この作業では、これらの方法によって達成されたサンプルのリアリズムと報酬の両方を高めるために、単純なラッパーであるレジダンスを考案します。
ユーザーが選択したアルゴリズムによって生成された候補ソリューション$ \ hat {x} $が与えられた場合、$ \ hat {x} $から逆に逆に逆にフローオードを実行し、結果の潜在性をDPSの初期化として使用することにより、ソリューションを反転することを提案します。
ラッパーを、大きな箱のインペーティングやスーパー解像度のような困難な逆の問題について評価します。
最先端のベースラインは目に見えて失敗しますが、これらのベースラインの上にラッパーを適用すると、サンプルの品質と測定の一貫性が大幅に向上することがわかります。
特定のマルチモーダルデータ分布で、レジダンスが報酬を同時に高め、候補ソリューションをデータマニホールドに近づけることを証明する理論でこれらの発見を補完します。
私たちの知る限り、これはDPSの最初の厳密なアルゴリズム保証を構成します。

要約(オリジナル)

There has been a flurry of activity around using pretrained diffusion models as informed data priors for solving inverse problems, and more generally around steering these models using reward models. Training-free methods like diffusion posterior sampling (DPS) and its many variants have offered flexible heuristic algorithms for these tasks, but when the reward is not informative enough, e.g., in hard inverse problems with low signal-to-noise ratio, these techniques veer off the data manifold, failing to produce realistic outputs. In this work, we devise a simple wrapper, ReGuidance, for boosting both the sample realism and reward achieved by these methods. Given a candidate solution $\hat{x}$ produced by an algorithm of the user’s choice, we propose inverting the solution by running the unconditional probability flow ODE in reverse starting from $\hat{x}$, and then using the resulting latent as an initialization for DPS. We evaluate our wrapper on hard inverse problems like large box in-painting and super-resolution with high upscaling. Whereas state-of-the-art baselines visibly fail, we find that applying our wrapper on top of these baselines significantly boosts sample quality and measurement consistency. We complement these findings with theory proving that on certain multimodal data distributions, ReGuidance simultaneously boosts the reward and brings the candidate solution closer to the data manifold. To our knowledge, this constitutes the first rigorous algorithmic guarantee for DPS.

arxiv情報

著者	Aayush Karan,Kulin Shah,Sitan Chen
発行日	2025-06-12 17:55:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

SpectralAR: Spectral Autoregressive Visual Generation

投稿日: 2025年6月13日作成者: jarxiv

要約

自己回帰視覚生成は、拡散モデルと比較して、そのスケーラビリティと他のモダリティとの互換性により、注目を集めています。
ほとんどの既存の方法は、自己回帰生成の空間パッチとして視覚シーケンスを構築します。
ただし、画像パッチは本質的に平行であり、自己回帰モデリングの因果的性質と矛盾しています。
これに対処するために、スペクトルの観点から視覚シーケンスの因果関係を実現するスペクトルの自己回帰（スペクトル）視覚生成フレームワークを提案します。
具体的には、最初に画像を、ネストされたスペクトルトークン化を備えた順序付けられたスペクトルトークンに変換し、より低い周波数コンポーネントからより高い頻度の成分を表します。
次に、スペクトルトークンのシーケンスを使用して、粗から洗練された方法で自己回帰生成を実行します。
画像内のさまざまなレベルの詳細を検討することにより、私たちのスペクトルは、鐘やホイッスルなしのシーケンス因果関係とトークン効率の両方を達成します。
画像の再構築と自己回帰生成のためにImagENET-1Kで広範な実験を実施し、スペクトラルはわずか64トークンと310mパラメーターで3.02 GFIDを達成します。
プロジェクトページ：https：//huang-yh.github.io/spectralar/。

要約(オリジナル)

Autoregressive visual generation has garnered increasing attention due to its scalability and compatibility with other modalities compared with diffusion models. Most existing methods construct visual sequences as spatial patches for autoregressive generation. However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. Specifically, we first transform an image into ordered spectral tokens with Nested Spectral Tokenization, representing lower to higher frequency components. We then perform autoregressive generation in a coarse-to-fine manner with the sequences of spectral tokens. By considering different levels of detail in images, our SpectralAR achieves both sequence causality and token efficiency without bells and whistles. We conduct extensive experiments on ImageNet-1K for image reconstruction and autoregressive generation, and SpectralAR achieves 3.02 gFID with only 64 tokens and 310M parameters. Project page: https://huang-yh.github.io/spectralar/.

arxiv情報

著者	Yuanhui Huang,Weiliang Chen,Wenzhao Zheng,Yueqi Duan,Jie Zhou,Jiwen Lu
発行日	2025-06-12 17:57:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning

投稿日: 2025年6月13日作成者: jarxiv

要約

このホワイトペーパーでは、画像生成モデルの推論能力を調査するために、大規模なマルチディシップラインマルチディスマルチイメージ生成ベンチマーク（MMMG）に加えて、新しいタスクとして知識画像生成を新しいタスクとして紹介します。
知識のイメージは、人間の文明と人間の学習のメカニズムの中心であり、二重コーディング理論と絵の監視効果によって強調されている事実です。
そのような画像を生成することは、世界の知識をピクセルレベルの接地と明確な説明ビジュアルに融合させる、挑戦的で要求の厳しいマルチモーダル推論です。
包括的な評価を可能にするために、MMMGは、10の分野、6つの教育レベル、チャート、図、マインドマップなどの多様な知識形式にまたがる4,456の専門家対象（知識）画像プロムプトペアを提供します。
評価中に交絡の複雑さを排除するために、統一された知識グラフ（kg）表現を採用します。
各kgは、ターゲット画像のコアエンティティとその依存関係を明示的に描写します。
さらに、MMMGスコアを紹介して、生成された知識画像を評価します。
このメトリックは、KG間のグラフ編集距離で測定された事実上の忠実度と、視覚的な明瞭さの評価を組み合わせています。
16の最先端のテキストからイメージへの生成モデルの包括的な評価により、深刻な推論障害（低いエンティティ、弱い関係、乱雑）がGPT-4oで、わずか50.20のMMMGスコアを達成し、ベンチマークの困難を強調しています。
さらなる進行のために、推論LLMと拡散モデルを組み合わせ、16,000のキュレーションナレッジイメージプロムプトペアでトレーニングされる効果的でオープンなベースラインであるFlux-Reason（MMMG-Score 34.45）をリリースします。

要約(オリジナル)

In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning–a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image’s core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits–low entity fidelity, weak relations, and clutter–with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark’s difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.

arxiv情報

著者	Yuxuan Luo,Yuhui Yuan,Junwen Chen,Haonan Cai,Ziyi Yue,Yuwei Yang,Fatima Zohra Daha,Ji Li,Zhouhui Lian
発行日	2025-06-12 17:58:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

投稿日: 2025年6月13日作成者: jarxiv

要約

マルチモーダルの大手言語モデル（MLLMS）では、入力視覚トークンの長さは、テキストの対応物の長さよりも大幅に大きく、推論コストが高くなります。
多くの作品は、冗長な視覚トークンを削除することにより、この問題に対処することを目指しています。
ただし、現在のアプローチは、多数の重複トークンを保持する注意ベースの剪定に依存するか、類似性ベースの剪定を使用して、命令関連性を見落とし、結果として下位パフォーマンスを引き起こします。
この論文では、保持トークンの条件付き多様性を最大化するCdprunerという名前の新しい視覚トークン剪定方法を提案することにより、注意や類似性を超えています。
最初に、命令に条件付けられた視覚トークン間の条件付き類似性を定義し、次に選択したサブセットの条件付き多様性を最大化するために、決定的な点プロセス（DPP）でトークン剪定問題を再定式化します。
提案されているCDPRUNERは、トレーニングなしでモデルに依存しているため、さまざまなMLLMに簡単に適用できます。
多様なMLLMを介した広範な実験は、CDPRUNERがさまざまなビジョン言語ベンチマークで新しい最先端のベンチマークを確立することを示しています。
DPPを介して条件付きの多様性を最大化することにより、選択されたサブセットは入力画像をよりよく表し、ユーザーの命令を密接に順守し、それにより高削減比でも強力なパフォーマンスを維持します。
Llavaに適用すると、CDPRUNERはフロップを95 \％、CUDAレイテンシを78 \％減らし、元の精度の94 \％を維持します。
私たちのコードは、https：//github.com/theia-4869/cdprunerで入手できます。

要約(オリジナル)

In multimodal large language models (MLLMs), the length of input visual tokens is often significantly greater than that of their textual counterparts, leading to a high inference cost. Many works aim to address this issue by removing redundant visual tokens. However, current approaches either rely on attention-based pruning, which retains numerous duplicate tokens, or use similarity-based pruning, overlooking the instruction relevance, consequently causing suboptimal performance. In this paper, we go beyond attention or similarity by proposing a novel visual token pruning method named CDPruner, which maximizes the conditional diversity of retained tokens. We first define the conditional similarity between visual tokens conditioned on the instruction, and then reformulate the token pruning problem with determinantal point process (DPP) to maximize the conditional diversity of the selected subset. The proposed CDPruner is training-free and model-agnostic, allowing easy application to various MLLMs. Extensive experiments across diverse MLLMs show that CDPruner establishes new state-of-the-art on various vision-language benchmarks. By maximizing conditional diversity through DPP, the selected subset better represents the input images while closely adhering to user instructions, thereby preserving strong performance even with high reduction ratios. When applied to LLaVA, CDPruner reduces FLOPs by 95\% and CUDA latency by 78\%, while maintaining 94\% of the original accuracy. Our code is available at https://github.com/Theia-4869/CDPruner.

arxiv情報

著者	Qizhe Zhang,Mengzhen Liu,Lichen Li,Ming Lu,Yuan Zhang,Junwen Pan,Qi She,Shanghang Zhang
発行日	2025-06-12 17:59:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop

投稿日: 2025年6月13日作成者: jarxiv

要約

人間は視覚的な世界を受動的に観察しません – 私たちは行動するために積極的に見ています。
この原則に動機付けられていると、実際のタスクを完了する必要性から生じる視線の動作を備えたロボットシステムであるEyerobotを紹介します。
私たちは、自由に回転して周囲を観察し、強化学習を使用してそれを制御するために視線ポリシーを訓練できる機械的な眼球を開発します。
これを最初に360カメラと組み合わせたテレオパードデモを収集することでこれを達成します。
このデータは、任意の眼球の視点のレンダリングをサポートするシミュレーション環境にインポートされ、ロボットデモンストレーションの上にエピソードロールアウトの目の視線が可能になります。
次に、BC-RLループを導入して手と目を共同で訓練します。手（BC）エージェントは、レンダリングされた眼の観察から訓練され、手が正しいアクション予測を生成すると眼（RL）エージェントが報われます。
このようにして、目がタスクを完了できる領域に目を向けると、手と目の調整が現れます。
Eyerobotは、中心窩に触発された政策アーキテクチャを実装して、小さな計算予算で高解像度を可能にします。これは、より安定した固定の出現と、オブジェクトを追跡してディストラクタを無視する能力の向上につながることがわかります。
ロボットアームを囲むアークで操作を必要とする5つのパノラマワークスペース操作タスクでEyerobotを評価します。
私たちの実験は、アイロボットが1つのカメラを使用して大きなワークスペース上の操作を効果的に促進する手と目の調整行動を示すことを示唆しています。
ビデオについてはプロジェクトサイトを参照してください：https：//www.eyerobot.net/

要約(オリジナル)

Humans do not passively observe the visual world — we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/

arxiv情報

著者	Justin Kerr,Kush Hari,Ethan Weber,Chung Min Kim,Brent Yi,Tyler Bonnen,Ken Goldberg,Angjoo Kanazawa
発行日	2025-06-12 17:59:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

GenWorld: Towards Detecting AI-generated Real-world Simulation Videos

投稿日: 2025年6月13日作成者: jarxiv

要約

ビデオ生成テクノロジーの繁栄は、実際の情報の信頼性を危険にさらし、AIに生成されたビデオ検出器の需要を強化しました。
ある程度の進歩にもかかわらず、高品質の実世界のデータセットの欠如は、信頼できる検出器の開発を妨げます。
このホワイトペーパーでは、AIに生成されたビデオ検出のための大規模で高品質の、実世界のシミュレーションデータセットであるGenWorldを提案します。
GenWorldには、次の特性があります。（1）実際のシミュレーション：GenWorldは、現実世界のシナリオを複製するビデオに焦点を当てています。
（2）高品質：GenWorldは、複数の最先端のビデオ生成モデルを採用して、現実的で高品質の偽造ビデオを提供します。
（3）クロスプロンプトの多様性：GenWorldには、多様なジェネレーターとさまざまな迅速なモダリティ（テキスト、画像、ビデオなど）から生成されたビデオが含まれており、より一般化可能な法医学的機能を学習する可能性を提供します。
既存の方法を分析し、世界モデル（つまり、コスモス）によって生成された高品質のビデオを検出できないことがわかり、実際の手がかりを無視する潜在的な欠点が明らかになります。
これに対処するために、現実世界のAIで生成されたビデオ検出の強力な基準としてマルチビューの一貫性を活用するために、シンプルで効果的なモデルであるSpannDetectorを提案します。
実験は、私たちの方法が優れた結果を達成することを示しており、物理的妥当性に基づいて説明可能なAIに生成されたビデオ検出の有望な方向を強調しています。
GenWorldは、AIに生成されたビデオ検出の分野を進めると考えています。
プロジェクトページ：https：//chen-wl20.github.io/genworld

要約(オリジナル)

The flourishing of video generation technologies has endangered the credibility of real-world information and intensified the demand for AI-generated video detectors. Despite some progress, the lack of high-quality real-world datasets hinders the development of trustworthy detectors. In this paper, we propose GenWorld, a large-scale, high-quality, and real-world simulation dataset for AI-generated video detection. GenWorld features the following characteristics: (1) Real-world Simulation: GenWorld focuses on videos that replicate real-world scenarios, which have a significant impact due to their realism and potential influence; (2) High Quality: GenWorld employs multiple state-of-the-art video generation models to provide realistic and high-quality forged videos; (3) Cross-prompt Diversity: GenWorld includes videos generated from diverse generators and various prompt modalities (e.g., text, image, video), offering the potential to learn more generalizable forensic features. We analyze existing methods and find they fail to detect high-quality videos generated by world models (i.e., Cosmos), revealing potential drawbacks of ignoring real-world clues. To address this, we propose a simple yet effective model, SpannDetector, to leverage multi-view consistency as a strong criterion for real-world AI-generated video detection. Experiments show that our method achieves superior results, highlighting a promising direction for explainable AI-generated video detection based on physical plausibility. We believe that GenWorld will advance the field of AI-generated video detection. Project Page: https://chen-wl20.github.io/GenWorld

arxiv情報

著者	Weiliang Chen,Wenzhao Zheng,Yu Zheng,Lei Chen,Jie Zhou,Jiwen Lu,Yueqi Duan
発行日	2025-06-12 17:59:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント