Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

要約

ビジョン言語モデル（VLM）は、画像キャプションやビデオ質問の回答などのオフラインタスクで顕著な進歩を示しています。
ただし、リアルタイムのインタラクティブな環境は、VLMSに新しい要求を課し、意味的に正確であるだけでなく、正確にタイミングされている発話を生成する必要があります。
このような設定に必要な2つのコア機能、$ \ textit {知覚更新} $および$ \ textit {contingence Awareness} $ – を特定し、新しいベンチマークタスクを提案します。
TGLGでは、モデルが動的視覚入力とタイミングの両方が整合するように、ストリーミングビデオに応答して発話を生成する必要があります。
このベンチマークをサポートするために、スポーツ放送およびエゴセントリックなヒト相互作用ドメインの評価データセットをキュレートし、セマンティックな類似性と時間的アライメントを共同で測定することによりTGLGを評価するために、新しいメトリック$ \ textBf {trace} $を導入します。
最後に、$ \ textBf {Vision-Languageモデルを時間と同期してインターリーブ（VLM-TSI）} $を紹介します。これは、視覚的および言語的トークンを時間と同級の方法で挿入し、ターンベースの仮定に頼らずにリアルタイムの言語生成を可能にします。
実験結果は、VLM-TSIが強力なベースラインを大幅に上回ることを示していますが、全体的なパフォーマンスは控えめなままです – TGLGの難しさを強調し、リアルタイムVLMのさらなる研究を動機付けます。
利用可能なコードとデータ$ \ href {https://github.com/yukw777/tglg} {here} $。

要約(オリジナル)

Vision-language models (VLMs) have shown remarkable progress in offline tasks such as image captioning and video question answering. However, real-time interactive environments impose new demands on VLMs, requiring them to generate utterances that are not only semantically accurate but also precisely timed. We identify two core capabilities necessary for such settings — $\textit{perceptual updating}$ and $\textit{contingency awareness}$ — and propose a new benchmark task, $\textbf{Temporally-Grounded Language Generation (TGLG)}$, to evaluate them. TGLG requires models to generate utterances in response to streaming video such that both content and timing align with dynamic visual input. To support this benchmark, we curate evaluation datasets from sports broadcasting and egocentric human interaction domains, and introduce a new metric, $\textbf{TRACE}$, to evaluate TGLG by jointly measuring semantic similarity and temporal alignment. Finally, we present $\textbf{Vision-Language Model with Time-Synchronized Interleaving (VLM-TSI)}$, a model that interleaves visual and linguistic tokens in a time-synchronized manner, enabling real-time language generation without relying on turn-based assumptions. Experimental results show that VLM-TSI significantly outperforms a strong baseline, yet overall performance remains modest — highlighting the difficulty of TGLG and motivating further research in real-time VLMs. Code and data available $\href{https://github.com/yukw777/tglg}{here}$.

arxiv情報

著者	Keunwoo Peter Yu,Joyce Chai
発行日	2025-05-16 14:48:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー