Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

要約

指導ビデオ内に特定のセグメントを見つけることは、ガイド知識を習得するための効率的な方法です。
一般的に、言語の説明と視覚的デモの両方のビデオセグメントを取得するタスクは、視覚回答のローカリゼーション（VAL）として知られています。
ただし、ユーザーは、システムを使用するときに期待に合わせた回答を得るために複数のインタラクションを必要とすることがよくあります。
これらの相互作用中、人間は自分自身に質問をすることでビデオコンテンツの理解を深め、それによって場所を正確に識別します。
したがって、視覚的な答えを得る手順で、人間とビデオの間の複数の相互作用をシミュレートするために、in-valという名前の新しいタスクを提案します。
VALタスクでは、1）入力質問のユーザー意図のあいまいさ、2）ビデオ字幕の言語の不完全性、および3）ビデオセグメントのコンテンツの断片化を含む、いくつかのセマンティックギャップの問題にインタラクティブに対処する必要があります。
これらの問題に対処するために、質問をすることでValを解決するためのフレームワークであるAsk2Locを提案します。
3つの重要なモジュールが含まれています。1）最初の質問を改良し、明確な意図を明らかにするチャットモジュール、2）流fluent言語を生成して完全な説明を作成する書き換えモジュール、3）ローカルコンテキストを広げて統合コンテンツを提供する検索モジュール。
3つの再構築されたVALデータセットで広範な実験を実施します。
従来のエンドツーエンドおよび2段階の方法と比較して、提案されたASK2LOCは、VALタスクでパフォーマンスを最大14.91（MIOU）増加させることができます。
コードとデータセットには、https：//github.com/changzong/ask2locでアクセスできます。

要約(オリジナル)

Locating specific segments within an instructional video is an efficient way to acquire guiding knowledge. Generally, the task of obtaining video segments for both verbal explanations and visual demonstrations is known as visual answer localization (VAL). However, users often need multiple interactions to obtain answers that align with their expectations when using the system. During these interactions, humans deepen their understanding of the video content by asking themselves questions, thereby accurately identifying the location. Therefore, we propose a new task, named In-VAL, to simulate the multiple interactions between humans and videos in the procedure of obtaining visual answers. The In-VAL task requires interactively addressing several semantic gap issues, including 1) the ambiguity of user intent in the input questions, 2) the incompleteness of language in video subtitles, and 3) the fragmentation of content in video segments. To address these issues, we propose Ask2Loc, a framework for resolving In-VAL by asking questions. It includes three key modules: 1) a chatting module to refine initial questions and uncover clear intentions, 2) a rewriting module to generate fluent language and create complete descriptions, and 3) a searching module to broaden local context and provide integrated content. We conduct extensive experiments on three reconstructed In-VAL datasets. Compared to traditional end-to-end and two-stage methods, our proposed Ask2Loc can improve performance by up to 14.91 (mIoU) on the In-VAL task. Our code and datasets can be accessed at https://github.com/changzong/Ask2Loc.

arxiv情報

著者	Chang Zong,Bin Li,Shoujun Zhou,Jian Wan,Lei Zhang
発行日	2025-04-23 03:01:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー