「cs.CV」カテゴリーアーカイブ

Towards Real-Time Open-Vocabulary Video Instance Segmentation

投稿日: 2024年12月6日作成者: jarxiv

要約このペーパーでは、オープン語彙ビデオインスタンスセグメンテーション ( … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Map It Anywhere (MIA): Empowering Bird’s Eye View Mapping using Large-scale Public Data

投稿日: 2024年12月6日作成者: jarxiv

要約トップダウンの鳥瞰図 (BEV) マップは、下流タスクの豊富さと柔軟性によ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

投稿日: 2024年12月6日作成者: jarxiv

要約テキストからビデオへの生成モデルは、近年大幅な進歩を示しています。しかし … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Learning Artistic Signatures: Symmetry Discovery and Style Transfer

投稿日: 2024年12月6日作成者: jarxiv

要約スタイルの伝達に関する文献は 10 年近くにもわたって存在していますが、芸 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

投稿日: 2024年12月6日作成者: jarxiv

要約広範なコーパスで事前トレーニングされた大規模言語モデルの最近の開発では、最 … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.RO | コメントを受け付けていません

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

投稿日: 2024年12月6日作成者: jarxiv

要約ビデオは、その性質上、本質的に時間的なシーケンスです。この研究では、自然 … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

投稿日: 2024年12月6日作成者: jarxiv

要約最近、大規模言語モデルの力を活用したマルチモーダル大規模言語モデルの出現に … 続きを読む →

カテゴリー: cs.AI, cs.CV | コメントを受け付けていません

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

投稿日: 2024年12月6日作成者: jarxiv

要約ビデオ普及モデルの最近の進歩により、リアルなオーディオ主導のトーキングビデ … 続きを読む →

カテゴリー: cs.CV | コメントを受け付けていません

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

投稿日: 2024年12月6日作成者: jarxiv

要約多様なタスクにわたるマルチモーダル大規模言語モデル (MLLM) の優れた … 続きを読む →

カテゴリー: cs.CL, cs.CV | コメントを受け付けていません

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

投稿日: 2024年12月6日作成者: jarxiv

要約この論文は、脚式ロボットによる視覚と言語のナビゲーションの問題を解決するこ … 続きを読む →

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

「cs.CV」カテゴリーアーカイブ

Towards Real-Time Open-Vocabulary Video Instance Segmentation

Map It Anywhere (MIA): Empowering Bird’s Eye View Mapping using Large-scale Public Data

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Learning Artistic Signatures: Symmetry Discovery and Style Transfer

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

最近の投稿

最近のコメント

アーカイブ

カテゴリー