「cs.AI」カテゴリーアーカイブ

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

投稿日: 2025年6月12日作成者: jarxiv

要約ビジョン言語モデル（VLM）は、多様な視覚的および言語的タスクで顕著なパフ … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

投稿日: 2025年6月12日作成者: jarxiv

要約拡散モデルは画像生成の最先端を表していますが、それらの高いメモリと計算の要 … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

投稿日: 2025年6月12日作成者: jarxiv

要約物理世界での因果関係のモデルの理解をプローブする質問回答ペアで構成されるビ … 続きを読む →

カテゴリー: cs.AI, cs.CV, I.2.10 | コメントを受け付けていません

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

投稿日: 2025年6月12日作成者: jarxiv

要約ポイントクラウドデータのスケールの多様性は、3Dビジョンのための統一された … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Outside Knowledge Conversational Video (OKCV) Dataset — Dialoguing over Videos

投稿日: 2025年6月12日作成者: jarxiv

要約外部の知識視覚的質問（OK-VQA）では、モデルは画像内に関連する視覚情報 … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

Vision Generalist Model: A Survey

投稿日: 2025年6月12日作成者: jarxiv

要約最近、私たちは自然言語加工におけるジェネラリストモデルの大成功を目撃しまし … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

投稿日: 2025年6月12日作成者: jarxiv

要約大規模な言語モデル（LLMS）を使用したテキストの推論が大幅に進歩している … 続きを読む →

カテゴリー: cs.AI, cs.CV, I.2 | コメントを受け付けていません

TerraMind: Large-Scale Generative Multimodality for Earth Observation

投稿日: 2025年6月12日作成者: jarxiv

要約地球観測のための最初の生成的なマルチモーダル基礎モデル（EO）であるTer … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

投稿日: 2025年6月12日作成者: jarxiv

要約現代のAIの主な課題は、世界を理解し、観察によって主に行動することを学ぶこ … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG, cs.RO | コメントを受け付けていません

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

投稿日: 2025年6月12日作成者: jarxiv

要約豊富なマルチモーダル条件を備えたエンドツーエンドの人間のアニメーション、例 … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.SD | コメントを受け付けていません

「cs.AI」カテゴリーアーカイブ

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

HadaNorm: Diffusion Transformer Quantization through Mean-Centered Transformations

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

Outside Knowledge Conversational Video (OKCV) Dataset — Dialoguing over Videos

Vision Generalist Model: A Survey

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

TerraMind: Large-Scale Generative Multimodality for Earth Observation

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

最近の投稿

最近のコメント

アーカイブ

カテゴリー