jarxiv | Japanese arxiv | ページ 823

Onboard Satellite Image Classification for Earth Observation: A Comparative Study of ViT Models

投稿日: 2025年4月23日作成者: jarxiv

要約

この研究は、オンボード衛星処理における土地利用分類の最も効果的な事前訓練モデルを特定し、衛星ベースの推論中に一般的に遭遇する騒々しいデータ条件に対する高精度、計算効率、および堅牢性の達成を強調することに焦点を当てています。
広範な実験を通じて、従来のCNNベース、ResNetベース、およびさまざまな事前訓練を受けたビジョントランスモデルのパフォーマンスを比較します。
私たちの調査結果は、事前に訓練された視覚変圧器（VIT）モデル、特にMobileVitv2およびEfficientVit-M2が、精度と効率の観点からゼロからトレーニングされたモデルを上回ることを示しています。
これらのモデルは、計算要件を減らして高性能を達成し、騒々しい条件下での推論中により大きな回復力を示します。
MobileVitv2はクリーン検証データに優れていますが、効率的なVIT-M2はノイズを処理する際により堅牢であることが証明されており、オンボード衛星EOタスクに最適なモデルになりました。
私たちの実験結果は、衛星操作における信頼性の高い効率的なRS-ICに効率的なVIT-M2が最適な選択であり、精度、精度、およびリコールの98.76％を達成することを示しています。
正確には、EfficientVit-M2はすべてのメトリックで最高のパフォーマンスを提供し、トレーニング効率（1,000）と推論時間（10S）に優れており、より大きな堅牢性（全体の堅牢性スコア0.79）を示します。
その結果、EfficientVit-M2はMobileVitv2（79.23 W）よりも63.93％少ない電力を消費し、Swintransformer（108.90 W）よりも73.26％少ない電力を消費します。
これは、エネルギー効率におけるその大きな利点を強調しています。

要約(オリジナル)

This study focuses on identifying the most effective pre-trained model for land use classification in onboard satellite processing, emphasizing achieving high accuracy, computational efficiency, and robustness against noisy data conditions commonly encountered during satellite-based inference. Through extensive experimentation, we compare the performance of traditional CNN-based, ResNet-based, and various pre-trained vision Transformer models. Our findings demonstrate that pre-trained Vision Transformer (ViT) models, particularly MobileViTV2 and EfficientViT-M2, outperform models trained from scratch in terms of accuracy and efficiency. These models achieve high performance with reduced computational requirements and exhibit greater resilience during inference under noisy conditions. While MobileViTV2 has excelled on clean validation data, EfficientViT-M2 has proved more robust when handling noise, making it the most suitable model for onboard satellite EO tasks. Our experimental results demonstrate that EfficientViT-M2 is the optimal choice for reliable and efficient RS-IC in satellite operations, achieving 98.76 % of accuracy, precision, and recall. Precisely, EfficientViT-M2 delivers the highest performance across all metrics, excels in training efficiency (1,000s) and inference time (10s), and demonstrates greater robustness (overall robustness score of 0.79). Consequently, EfficientViT-M2 consumes 63.93 % less power than MobileViTV2 (79.23 W) and 73.26 % less power than SwinTransformer (108.90 W). This highlights its significant advantage in energy efficiency.

arxiv情報

著者	Thanh-Dung Le,Vu Nguyen Ha,Ti Ti Nguyen,Geoffrey Eappen,Prabhu Thiruvasagam,Hong-fu Chou,Duc-Dung Tran,Hung Nguyen-Kha,Luis M. Garces-Socarras,Jorge L. Gonzalez-Rios,Juan Carlos Merlano-Duncan,Symeon Chatzinotas
発行日	2025-04-22 14:51:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, eess.SP | コメントを受け付けていません

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

投稿日: 2025年4月23日作成者: jarxiv

要約

対象主導の画像生成は、テキストガイダンスを順守しながら、主題のアイデンティティを参照画像から忠実に保存する新しいシーンを統合することを目的としていますが、既存の方法は忠実さと効率の間の重要なトレードオフに苦しんでいます。
チューニングベースのアプローチは、時間のかかるリソース集約型の主題固有の最適化に依存していますが、ゼロショットメソッドは適切なサブジェクトの一貫性を維持できません。
この作業では、クロスイメージ機能のグラフトを通じてこれらの制限に対処するトレーニングなしのフレームワークであるFreegraftorを提案します。
具体的には、FreeGraftorはセマンティックマッチングと位置制約の注意融合を使用して、参照科目から生成された画像に視覚的な詳細を転送します。
さらに、当社のフレームワークには、堅牢な特徴マッチングのために参照主題のジオメトリプリエを保持するための新しいノイズ初期化戦略が組み込まれています。
広範な定性的および定量的実験は、テキストに合わせたシーンの統合を維持しながら、私たちの方法が正確な被験者の同一性転送を可能にすることを示しています。
モデルの微調整や追加のトレーニングを必要とせずに、FreeGraftorは、既存のゼロショットとテキストの忠実度とテキストの調整の両方で大幅に優れています。
さらに、私たちのフレームワークは、マルチサブジェクトの生成にシームレスに拡張することができ、実際の展開に実用的になります。
私たちのコードは、https：//github.com/nihukat/freegraftorで入手できます。

要約(オリジナル)

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance, yet existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive subject-specific optimization, while zero-shot methods fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image. Additionally, our framework incorporates a novel noise initialization strategy to preserve geometry priors of reference subjects for robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

arxiv情報

著者	Zebin Yao,Lei Ren,Huixing Jiang,Chen Wei,Xiaojie Wang,Ruifan Li,Fangxiang Feng
発行日	2025-04-22 14:55:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Is Large-Scale Pretraining the Secret to Good Domain Generalization?

投稿日: 2025年4月23日作成者: jarxiv

要約

マルチソースドメイン一般化（DG）は、複数のソースドメインでトレーニングし、目に見えないターゲットドメインで高い分類パフォーマンスを実現するタスクです。
最近の方法では、Webスケールの前提条件のバックボーンとソースデータから学習した新機能と堅牢な機能を組み合わせており、これによりベンチマークの結果が劇的に改善されました。
ただし、DG Finetuningメソッドが時間の経過とともに良くなっているのか、ベンチマークパフォーマンスが改善されているのかは、単により強力なトレーニングのアーティファクトであるかどうかは不明のままです。
以前の研究では、トレーニング前のデータとの知覚的な類似性はゼロショットパフォーマンスと相関することが示されていますが、DG設定ではその効果が限られていることがわかりました。
代わりに、事前トレーニングで知覚的に類似したデータを持っているだけでは十分ではないと仮定します。
そして、パフォーマンスを決定するのはこれらのデータがどれだけよく学習されたかです。
これにより、アライメント仮説が導入されます。これは、画像とクラスラベルのテキストの埋め込みが高い場合にのみ、最終的なDGパフォーマンスが高くなると述べています。
私たちの実験は、アラインメント仮説が真であることを確認し、評価データを予定内（IP）および予定外（OOP）に分割することにより、ドメインベッドデータセットで評価された既存のDGメソッドの分析ツールとして使用します。
評価されたすべてのDGメソッドがドメインベッドアップで苦労している一方で、最近のメソッドはドメインベッドIPで優れていることを示しています。
まとめて、私たちの調査結果は、事前トレーニングアライメントを超えて一般化できるDGメソッドの必要性を強調しています。

要約(オリジナル)

Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.

arxiv情報

著者	Piotr Teterwak,Kuniaki Saito,Theodoros Tsiligkaridis,Bryan A. Plummer,Kate Saenko
発行日	2025-04-22 15:04:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

Recent Advances and Future Directions in Extended Reality (XR): Exploring AI-Powered Spatial Intelligence

投稿日: 2025年4月23日作成者: jarxiv

要約

拡張現実（XR）、拡張現実（AR）、バーチャルリアリティ（VR）、および混合現実（MR）を包含することは、物理的および仮想世界を埋める変革的な技術であり、将来的に遍在する多様な可能性を秘めています。
このレビューでは、モニターからセンサー、視覚的なタスクからユーザーインターフェイスに至るまでのソフトウェアまで、基礎フレームワークを通じてXRの進化を検証します。
基本的な枠組みに基づいたパフォーマンスの比較と分析を伴う最先端（SOTA）XR製品を強調します。
商用XRデバイスが空間インテリジェンスに焦点を当てた高品質のパフォーマンスの需要をどのようにサポートできるかについて説明します。
将来の方向性については、適応型XRシステムを有効にするために、マルチモーダルAIおよびIoT駆動型のデジタルツインの統合に注意を払う必要があります。
空間インテリジェンスの概念により、将来のXRは、人類に利益をもたらす現実的な経験を持つ新しいデジタル空間を確立する必要があります。
このレビューでは、XRのロック解除におけるAIの極めて重要な役割を、人間コンピューターの相互作用における次のフロンティアとして強調しています。

要約(オリジナル)

Extended Reality (XR), encompassing Augmented Reality (AR), Virtual Reality (VR) and Mixed Reality (MR), is a transformative technology bridging the physical and virtual world and it has diverse potential which will be ubiquitous in the future. This review examines XR’s evolution through foundational framework – hardware ranging from monitors to sensors and software ranging from visual tasks to user interface; highlights state of the art (SOTA) XR products with the comparison and analysis of performance based on their foundational framework; discusses how commercial XR devices can support the demand of high-quality performance focusing on spatial intelligence. For future directions, attention should be given to the integration of multi-modal AI and IoT-driven digital twins to enable adaptive XR systems. With the concept of spatial intelligence, future XR should establish a new digital space with realistic experience that benefits humanity. This review underscores the pivotal role of AI in unlocking XR as the next frontier in human-computer interaction.

arxiv情報

著者	Baichuan Zeng
発行日	2025-04-22 15:11:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.HC, cs.MA | コメントを受け付けていません

A New Graph Grammar Formalism for Robust Syntactic Pattern Recognition

投稿日: 2025年4月23日作成者: jarxiv

要約

再帰的に構造化されたグラフ様パターンの構文を表すための形式を紹介します。
従来のグラフ文法のような生産ルールを使用するのではなく、より直接的で宣言的な方法で構文構造を表します。
文法とパターンはどちらもネットワークとして表され、解析はパターンから文法までの同性愛の構築と見なされます。
文法は、複数の次元で反復的で階層的でネストされた再帰構造を表すことができます。
これは、パターン認識のあらゆる側面（機能検出、セグメンテーション、解析、欠落したシンボルの埋め、トップダウン、ボトムアップ推論）のすべての側面が、それらの間の相乗効果を活用するために、単一のプロセスに統合される非常に平行したスタイルの解析をサポートします。
この論文の重点は、根底にある理論的問題にありますが、幾何学的な関係、ぼやけたシンボル、ぼやけたシンボル、重複したシンボル、雑然とした画像、および誤ったパッチを含む、50-1000シンボルの複雑な再帰的構造パターンのエラー耐性解析を示すために、いくつかの例を挙げて実行します。

要約(オリジナル)

I introduce a formalism for representing the syntax of recursively structured graph-like patterns. It does not use production rules, like a conventional graph grammar, but represents the syntactic structure in a more direct and declarative way. The grammar and the pattern are both represented as networks, and parsing is seen as the construction of a homomorphism from the pattern to the grammar. The grammars can represent iterative, hierarchical and nested recursive structure in more than one dimension. This supports a highly parallel style of parsing, in which all aspects of pattern recognition (feature detection, segmentation, parsing, filling in missing symbols, top-down and bottom-up inference) are integrated into a single process, to exploit the synergy between them. The emphasis of this paper is on underlying theoretical issues, but I also give some example runs to illustrate the error-tolerant parsing of complex recursively structured patterns of 50-1000 symbols, involving variability in geometric relationships, blurry and indistinct symbols, overlapping symbols, cluttered images, and erased patches.

arxiv情報

著者	Peter Fletcher
発行日	2025-04-22 15:23:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.FL, F.4.2 | コメントを受け付けていません

Efficient Adaptation of Deep Neural Networks for Semantic Segmentation in Space Applications

投稿日: 2025年4月23日作成者: jarxiv

要約

近年、深い学習技術の適用は、さまざまなコンピュータービジョンタスクで顕著な成功を示しており、地球外探査での展開への道を開いています。
転送学習は、これらの新しい環境でラベル付けされたデータの希少性に対処するための強力な戦略として浮上しています。
この論文は、主に月と火星の地形に焦点を当てた地球外景観における岩石セグメンテーションの効率的な転送学習にアダプターを使用することの実現可能性を評価する最初の取り組みの1つを表しています。
私たちの研究は、事前に訓練されたバックボーンモデルに戦略的に統合されたアダプターの使用が、ターゲット外地球デバイスの帯域幅とメモリの両方の要件を削減することに成功できることを示唆しています。
この研究では、2つのメモリを節約する戦略を検討しました。レイヤー融合（推論オーバーヘッドをゼロにする）と「アダプターのランキング」（伝送コストも削減するため）です。
最後に、これらの結果を、埋め込みデバイスでのタスクのパフォーマンス、メモリ、および計算の観点から評価し、現場でのより多くの研究への道を開くトレードオフを証明します。

要約(オリジナル)

In recent years, the application of Deep Learning techniques has shown remarkable success in various computer vision tasks, paving the way for their deployment in extraterrestrial exploration. Transfer learning has emerged as a powerful strategy for addressing the scarcity of labeled data in these novel environments. This paper represents one of the first efforts in evaluating the feasibility of employing adapters toward efficient transfer learning for rock segmentation in extraterrestrial landscapes, mainly focusing on lunar and martian terrains. Our work suggests that the use of adapters, strategically integrated into a pre-trained backbone model, can be successful in reducing both bandwidth and memory requirements for the target extraterrestrial device. In this study, we considered two memory-saving strategies: layer fusion (to reduce to zero the inference overhead) and an “adapter ranking” (to also reduce the transmission cost). Finally, we evaluate these results in terms of task performance, memory, and computation on embedded devices, evidencing trade-offs that open the road to more research in the field.

arxiv情報

著者	Leonardo Olivi,Edoardo Santero Mormile,Enzo Tartaglione
発行日	2025-04-22 15:53:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment

投稿日: 2025年4月23日作成者: jarxiv

要約

長期の高解像度ビデオの急速な成長により、効率的なビデオ品質評価（VQA）が重大な課題になりました。
通常、既存の研究は、モデルパラメーターの削減と入力の再サンプリングという2つの主要な戦略を通じて、この問題に取り組んでいます。
ただし、長距離モデリング機能の要件により、軽量の畳み込みニューラルネットワーク（CNN）とトランスは、効率と高性能のバランスをとるのに苦労しています。
最近、状態空間モデル、特にMambaは有望な代替として浮上し、シーケンス長に関して線形の複雑さを提供しています。
一方、効率的なVQAは、計算コストを最小限に抑えるために長いシーケンスの再サンプリングに大きく依存していますが、現在の再サンプリング方法は、必須のセマンティック情報の保存において弱いことがよくあります。
この作業では、効率的なVQA用に設計されたMAMBAベースのモデルであるMVQAと、新しい統一されたセマンティックおよび歪みサンプリング（USDS）アプローチを提示します。
USDSは、低解像度のビデオからのセマンティックパッチサンプリングと、オリジナル解像度のビデオからの歪みパッチサンプリングを組み合わせています。
前者は意味的に密な領域をキャプチャしますが、後者は重要な歪みの詳細を保持します。
デュアル入力からの計算の増加を防ぐために、事前定義されたマスクを使用して融合メカニズムを提案し、追加の計算負担なしでセマンティック情報と品質情報の両方をキャプチャする統一されたサンプリング戦略を可能にします。
実験では、提案されたMVQAがUSDSを装備しており、最先端の方法に匹敵するパフォーマンスを達成しながら、$ 2 \ Times $と同じくらい速く、$ 1/5 $ GPUメモリを必要とすることが示されています。

要約(オリジナル)

The rapid growth of long-duration, high-definition videos has made efficient video quality assessment (VQA) a critical challenge. Existing research typically tackles this problem through two main strategies: reducing model parameters and resampling inputs. However, light-weight Convolution Neural Networks (CNN) and Transformers often struggle to balance efficiency with high performance due to the requirement of long-range modeling capabilities. Recently, the state-space model, particularly Mamba, has emerged as a promising alternative, offering linear complexity with respect to sequence length. Meanwhile, efficient VQA heavily depends on resampling long sequences to minimize computational costs, yet current resampling methods are often weak in preserving essential semantic information. In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos. The former captures semantically dense regions, while the latter retains critical distortion details. To prevent computation increase from dual inputs, we propose a fusion mechanism using pre-defined masks, enabling a unified sampling strategy that captures both semantic and quality information without additional computational burden. Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods while being $2\times$ as fast and requiring only $1/5$ GPU memory.

arxiv情報

著者	Yachun Mi,Yu Li,Weicheng Meng,Chaofeng Chen,Chen Hui,Shaohui Liu
発行日	2025-04-22 16:08:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework

投稿日: 2025年4月23日作成者: jarxiv

要約

アダプターベースの方法は、特にフレーム間の一貫性を必要とするビデオ編集タスクで、最小限の複雑さでモデルのパフォーマンスを強化するために一般的に使用されます。
小規模で学習可能なモジュールを前処理された拡散モデルに挿入することにより、これらのアダプターは、広範な再訓練なしに時間的一貫性を維持できます。
共有トークンとフレーム固有のトークンの両方で迅速な学習を組み込んだアプローチは、低トレーニングコストでフレーム間で継続性を維持するのに特に効果的です。
この作業では、時間的一貫性の損失の下でDDIMベースのモデルのフレームの一貫性を維持するアダプターの一般的な理論的フレームワークを提供したいと考えています。
まず、時間的一貫性の目的は、境界のある特徴規範の下で微分可能であることを証明し、その勾配に縛られたリプシッツを確立します。
第二に、この目的の勾配降下は、学習率が適切な範囲内にある場合、単調に損失を減少させ、局所最小に収束することを示します。
最後に、DDIM反転手順のモジュールの安定性を分析し、関連する誤差が制御されたままであることを示します。
これらの理論的発見は、アダプター戦略に依存している拡散ベースのビデオ編集方法の信頼性を強化し、ビデオ生成タスクの理論的洞察を提供します。

要約(オリジナル)

Adapter-based methods are commonly used to enhance model performance with minimal additional complexity, especially in video editing tasks that require frame-to-frame consistency. By inserting small, learnable modules into pretrained diffusion models, these adapters can maintain temporal coherence without extensive retraining. Approaches that incorporate prompt learning with both shared and frame-specific tokens are particularly effective in preserving continuity across frames at low training cost. In this work, we want to provide a general theoretical framework for adapters that maintain frame consistency in DDIM-based models under a temporal consistency loss. First, we prove that the temporal consistency objective is differentiable under bounded feature norms, and we establish a Lipschitz bound on its gradient. Second, we show that gradient descent on this objective decreases the loss monotonically and converges to a local minimum if the learning rate is within an appropriate range. Finally, we analyze the stability of modules in the DDIM inversion procedure, showing that the associated error remains controlled. These theoretical findings will reinforce the reliability of diffusion-based video editing methods that rely on adapter strategies and provide theoretical insights in video generation tasks.

arxiv情報

著者	Xinyuan Song,Yangfan He,Sida Li,Jianhui Wang,Hongyang He,Xinhang Yuan,Ruoyu Wang,Jiaqi Chen,Keqin Li,Kuan Lu,Menghao Huo,Binxu Li,Pei Liu
発行日	2025-04-22 16:28:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning

投稿日: 2025年4月23日作成者: jarxiv

要約

Point Cloudの自己教師の表現学習は、多様なタスク全体で事前に訓練されたモデルパフォーマンスを改善する上で有効性を実証しています。
ただし、事前に訓練されたモデルが複雑に成長するにつれて、ダウンストリームアプリケーションのために完全に微調整するには、かなりの計算およびストレージリソースが必要です。
パラメーター効率の高い微調整（PEFT）メソッドは、これらのリソース要件を軽減するための有望なソリューションを提供しますが、現在のアプローチのほとんどは、複雑なアダプターと調整可能なパラメーターを増やす迅速なメカニズムに依存しています。
このホワイトペーパーでは、Pointloraを提案します。Pointloraは、低ランク適応（LORA）とマルチスケールトークン選択を組み合わせて、ポイントクラウドモデルを効率的に微調整するためのシンプルで効果的な方法です。
私たちのアプローチは、ポイントクラウドトランスの最もパラメーター集約型コンポーネントにロラ層を埋め込み、グローバルな機能キャプチャを強化しながら調整可能なパラメーターの必要性を減らします。
さらに、マルチスケールのトークン選択は、重要なローカル情報を抽出して、ダウンストリームの微調整のプロンプトとして機能し、LORAによってキャプチャされたグローバルコンテキストを効果的に補完します。
さまざまな訓練を受けたモデルと3つの挑戦的なパブリックデータセットにわたる実験結果は、トレーニング可能なパラメーターの3.43％で競争力のあるパフォーマンスを実現し、リソース制約のあるアプリケーションに非常に効果的であることを示しています。
ソースコードは、https：//github.com/songw-zju/pointloraで入手できます。

要約(オリジナル)

Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks. However, as pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources. Parameter-efficient fine-tuning (PEFT) methods offer a promising solution to mitigate these resource requirements, yet most current approaches rely on complex adapter and prompt mechanisms that increase tunable parameters. In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. Our approach embeds LoRA layers within the most parameter-intensive components of point cloud transformers, reducing the need for tunable parameters while enhancing global feature capture. Additionally, multi-scale token selection extracts critical local information to serve as prompts for downstream fine-tuning, effectively complementing the global context captured by LoRA. The experimental results across various pre-trained models and three challenging public datasets demonstrate that our approach achieves competitive performance with only 3.43% of the trainable parameters, making it highly effective for resource-constrained applications. Source code is available at: https://github.com/songw-zju/PointLoRA.

arxiv情報

著者	Song Wang,Xiaolu Liu,Lingdong Kong,Jianyun Xu,Chunyong Hu,Gongfan Fang,Wentong Li,Jianke Zhu,Xinchao Wang
発行日	2025-04-22 16:41:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

投稿日: 2025年4月23日作成者: jarxiv

要約

最近のビデオ大規模な言語モデル（ビデオLLM）は、多くの場合、高価な人間の注釈または独自のモデルAPI（GPT-4Oなど）に依存して、トレーニングデータを制限するトレーニングデータを作成します。
このホワイトペーパーでは、安価な自動音声認識（ASR）転写産物を使用して、ビデオLLMの大規模なトレーニングを検討します。
具体的には、タイムスタンプに応じてASRの単語とビデオフレームを密に補強する新しいストリーミングトレーニングアプローチを提案します。
ASRを使用した視覚言語表現の以前の研究と比較して、我々の方法はASRのストリーミング特性に自然に適合するため、モデルが一時的に整列した細かい視覚言語モデリングを学習できるようになります。
トレーニングアルゴリズムをサポートするために、YouTubeビデオとその閉じたキャプション（CC、ASRと同じ）を処理するためのデータ生産パイプラインを導入し、事前トレーニングおよびLive-WhisPerx-526KデータセットのライブCC-5Mデータセットを導入します。
驚くべきことに、SFTがなくても、ASRのみの訓練を受けたLiveCC-7Bベースモデルは、競争力のある一般的なビデオQAパフォーマンスを実証し、リアルタイムビデオ解説の新しい機能を示します。
これを評価するために、LLM-as-a-judgeを使用してフリーフォームの解説を測定するために、新しいLiveSports-3Kベンチマークを慎重に設計します。
実験では、最終的なLIVECC-7B-Instructモデルが高度な72Bモデル（QWEN2.5-VL-72B-Instruct、Llava-Video-72b）を上回ることができることが示されています。
一方、VideMommeやOvobenchなどの人気のあるビデオQAベンチマークで7B/8Bスケールで最新の結果を達成し、アプローチの広範な一般化可能性を示しています。
このペーパーのすべてのリソースは、https：//showlab.github.io/liveccでリリースされています。

要約(オリジナル)

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

arxiv情報

著者	Joya Chen,Ziyun Zeng,Yiqi Lin,Wei Li,Zejun Ma,Mike Zheng Shou
発行日	2025-04-22 16:52:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント