Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

要約

時系列予測における最近の進歩により、テキストまたはビジョンモダリティを使用したモデルの増強モデルが精度を向上させました。
テキストは文脈上の理解を提供しますが、多くの場合、細粒の一時的な詳細が欠けています。
逆に、ビジョンは複雑な時間的パターンをキャプチャしますが、セマンティックコンテキストを欠いており、これらのモダリティの補完的な可能性を制限します。
これに対処するために、事前に訓練されたビジョン言語モデル（VLM）を活用して、予測を強化するための時間、視覚、およびテキストのモダリティを橋渡しする新しいマルチモーダルフレームワークである\ Methodを提案します。
私たちのフレームワークは、3つの主要なコンポーネントで構成されています。（1）メモリバンクの相互作用を通じて濃縮された時間的特徴を抽出する検索された熟成学習者。
（2）時系列を有益な画像としてコードする視覚熟成学習者。
（3）文脈的なテキストの説明を生成するテキストの高級学習者。
これらのコンポーネントは、凍結した事前に訓練されたVLMと協力して、マルチモーダル埋め込みを生成し、最終予測のために時間的特徴と融合します。
広範な実験は、時間VLMが特に少ないショットおよびゼロショットシナリオで優れたパフォーマンスを達成し、それによってマルチモーダルの時系列予測の新しい方向性を確立することを示しています。
コードはhttps://github.com/citymind-lab/icml25-timevlmで入手できます。

要約(オリジナル)

Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.

arxiv情報

著者	Siru Zhong,Weilin Ruan,Ming Jin,Huan Li,Qingsong Wen,Yuxuan Liang
発行日	2025-05-26 14:45:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー