OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

要約

我々は、マルチモーダルなアテンションベースのダイアログ状態トラッカー上で動作するビデオダイアログの新しいモデルである Object Language Video Transformer (OLViT) を紹介します。
既存のビデオ対話モデルは、ビデオ内の空間的および時間的位置特定、長期的な時間的推論、および複数の対話ターンにわたる正確なオブジェクト追跡を必要とする質問に苦戦しています。
OLViT は、オブジェクトステートトラッカー (OST) と言語ステートトラッカー (LST) の出力に基づいてグローバルダイアログステートを維持することで、これらの課題に対処します。OST がビデオ内の最も重要なオブジェクトに注目する一方で、LST はビデオ内の最も重要なオブジェクトを追跡します。
最も重要な言語的相互参照は、以前の会話ターンと同じです。
以前の研究とはまったく対照的に、私たちのアプローチは本質的に汎用的であるため、最も関連性の高いオブジェクトとラウンドの連続的なマルチモーダルダイアログ状態表現を学習することができます。
その結果、大規模言語モデル (LLM) にシームレスに統合でき、さまざまなデータセットやタスクを処理する際に高い柔軟性を提供します。
困難な DVD (応答分類) および SIMMC 2.1 (応答生成) データセットの評価では、OLViT が両方のデータセットにわたって新しい最先端のパフォーマンスを達成していることが示されています。

要約(オリジナル)

We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

arxiv情報

著者	Adnen Abdessaied,Manuel von Hochmeister,Andreas Bulling
発行日	2024-02-20 17:00:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー