「cs.MM」カテゴリーアーカイブ

MoRAG — Multi-Fusion Retrieval Augmented Generation for Human Motion

投稿日: 2024年12月11日作成者: jarxiv

要約テキストベースの人間のモーション生成のための、新しいマルチパート融合ベース … 続きを読む →

カテゴリー: cs.CV, cs.MM | コメントを受け付けていません

STIV: Scalable Text and Image Conditioned Video Generation

投稿日: 2024年12月11日作成者: jarxiv

要約ビデオ生成の分野は目覚ましい進歩を遂げていますが、堅牢でスケーラブルなモデ … 続きを読む →

カテゴリー: cs.AI, cs.CV, cs.LG, cs.MM | コメントを受け付けていません

AI TrackMate: Finally, Someone Who Will Give Your Music More Than Just ‘Sounds Great!’

投稿日: 2024年12月10日作成者: jarxiv

要約「ベッドルームプロデューサー」の台頭により、音楽制作が民主化される一方で、 … 続きを読む →

カテゴリー: cs.HC, cs.LG, cs.MM, cs.SD, eess.AS | コメントを受け付けていません

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

投稿日: 2024年12月10日作成者: jarxiv

要約音声合成としても知られる Text-to-Speech (TTS) は、テ … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.LG, cs.MM, cs.SD, eess.AS | コメントを受け付けていません

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

投稿日: 2024年12月10日作成者: jarxiv

要約大規模言語モデル (LLM) の急速な進歩により、多言語サポートからドメイ … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG, cs.MM | コメントを受け付けていません

LinVT: Empower Your Image-level Large Language Model to Understand Videos

投稿日: 2024年12月9日作成者: jarxiv

要約大規模言語モデル (LLM) はさまざまなタスクで広く使用されており、ビデ … 続きを読む →

カテゴリー: cs.CV, cs.LG, cs.MM | コメントを受け付けていません

Copy-Move Forgery Detection and Question Answering for Remote Sensing Image

投稿日: 2024年12月4日作成者: jarxiv

要約本稿では、リモートセンシング複写移動質問応答（RSCMQA）のタスクを紹介 … 続きを読む →

カテゴリー: cs.CV, cs.MM | コメントを受け付けていません

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

投稿日: 2024年12月4日作成者: jarxiv

要約近年、GPT-4o、Gemini 1.5 Pro、Reka Coreなどの … 続きを読む →

カテゴリー: cs.AI, cs.CL, cs.CV, cs.MM, cs.SD, eess.AS | コメントを受け付けていません

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

投稿日: 2024年12月3日作成者: jarxiv

要約最近の研究により、音声による話し顔の生成は大幅に進歩しましたが、生成された … 続きを読む →

カテゴリー: cs.CV, cs.MM | コメントを受け付けていません

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

投稿日: 2024年12月2日作成者: jarxiv

要約既存の Multimoal Large Language Model (M … 続きを読む →

カテゴリー: cs.CL, cs.CV, cs.LG, cs.MM | コメントを受け付けていません

「cs.MM」カテゴリーアーカイブ

MoRAG — Multi-Fusion Retrieval Augmented Generation for Human Motion

STIV: Scalable Text and Image Conditioned Video Generation

AI TrackMate: Finally, Someone Who Will Give Your Music More Than Just ‘Sounds Great!’

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Copy-Move Forgery Detection and Question Answering for Remote Sensing Image

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

最近の投稿

最近のコメント

アーカイブ

カテゴリー