投稿者「jarxiv」のアーカイブ

Multimodal Long Video Modeling Based on Temporal Dynamic Context

投稿日: 2025年4月15日作成者: jarxiv

要約大規模な言語モデル（LLMS）の最近の進歩により、ビデオ理解の大きなブレー … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.MM | コメントを受け付けていません

Learning Free Token Reduction for Multi-Modal Large Language Models

投稿日: 2025年4月15日作成者: jarxiv

要約ビジョン言語モデル（VLM）は、さまざまなマルチモーダルタスクで顕著な成功 … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

投稿日: 2025年4月15日作成者: jarxiv

要約長老のWebベースのタスクで成功した支援を達成するには、AIエージェントは … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG | コメントを受け付けていません

Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

投稿日: 2025年4月15日作成者: jarxiv

要約急性および癒しが困難な創傷の効果的な認識は、創傷診断に必要なステップです。 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

投稿日: 2025年4月15日作成者: jarxiv

要約グラフィカルユーザーインターフェイス（GUI）エージェントの構築における既 … 続きを読む →

カテゴリー: cs.CL, cs.CV, cs.HC | コメントを受け付けていません

MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration

投稿日: 2025年4月15日作成者: jarxiv

要約最近、トランスネットワークは、グローバルな受容フィールドと入力への適応性に … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

投稿日: 2025年4月15日作成者: jarxiv

要約このペーパーでは、単一のアーキテクチャ内で生のピクセルエンコードと言語デコ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

投稿日: 2025年4月15日作成者: jarxiv

要約マルチモーダル大手言語モデル（MLLM）は、きめ細かいピクセルレベルの理解 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

SplatMesh: Interactive 3D Segmentation and Editing Using Mesh-Based Gaussian Splatting

投稿日: 2025年4月15日作成者: jarxiv

要約きめ細かい3Dベースのインタラクティブ編集の重要な課題は、特定のメモリ制約 … 続きを読む →

カテゴリー: cs.CV, cs.GR | コメントを受け付けていません

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

投稿日: 2025年4月15日作成者: jarxiv

要約大規模な事前訓練を受けた画像から3Dの生成モデルは、多様な形状の世代に顕著 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

投稿者「jarxiv」のアーカイブ

Multimodal Long Video Modeling Based on Temporal Dynamic Context

Learning Free Token Reduction for Multi-Modal Large Language Models

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

SplatMesh: Interactive 3D Segmentation and Editing Using Mesh-Based Gaussian Splatting

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

最近の投稿

最近のコメント

アーカイブ

カテゴリー