投稿者「jarxiv」のアーカイブ

Understanding Long Videos with Multimodal Language Models

投稿日: 2025年6月12日作成者: jarxiv

要約大規模な言語モデル（LLM）により、最近のLLMベースのアプローチが可能に … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Efficient Part-level 3D Object Generation via Dual Volume Packing

投稿日: 2025年6月12日作成者: jarxiv

要約 3Dオブジェクトの生成の最近の進歩により、品質と効率の両方が大幅に改善され … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

ReSim: Reliable World Simulation for Autonomous Driving

投稿日: 2025年6月12日作成者: jarxiv

要約幅広いエゴ運転行動の下で、将来の運転シナリオをどのように確実にシミュレート … 続きを読む →

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

投稿日: 2025年6月12日作成者: jarxiv

要約 4Dコンテンツ生成の最近の進歩により、注目が高まっていますが、高品質のアニ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

投稿日: 2025年6月12日作成者: jarxiv

要約現代のAIの主な課題は、世界を理解し、観察によって主に行動することを学ぶこ … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG, cs.RO | コメントを受け付けていません

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

投稿日: 2025年6月12日作成者: jarxiv

要約豊富なマルチモーダル条件を備えたエンドツーエンドの人間のアニメーション、例 … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.SD | コメントを受け付けていません

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

投稿日: 2025年6月12日作成者: jarxiv

要約ビデオ言語モデルの時空間的理解と推論能力を評価するための既存のベンチマーク … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

投稿日: 2025年6月12日作成者: jarxiv

要約生成AIの最近の進歩に支えられたテキスト誘導画像編集は、ますます広まってい … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

投稿日: 2025年6月12日作成者: jarxiv

要約次の質問をすることにより、3Dシーンの再構成をインタラクティブに再構築する … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Spectral Image Tokenizer

投稿日: 2025年6月12日作成者: jarxiv

要約画像トークナーは、画像を離散トークンのシーケンスにマッピングし、自己回帰ト … 続きを読む →

カテゴリー: cs.CV, cs.LG | コメントを受け付けていません

投稿者「jarxiv」のアーカイブ

Understanding Long Videos with Multimodal Language Models

Efficient Part-level 3D Object Generation via Dual Volume Packing

ReSim: Reliable World Simulation for Autonomous Driving

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes

Spectral Image Tokenizer

最近の投稿

最近のコメント

アーカイブ

カテゴリー