EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

要約

本稿では、音声表現に基づいてビデオシーケンス内の特定のオブジェクトを動的に追跡する聴覚参照型マルチオブジェクト追跡（AR-MOT）のタスクを紹介し、自動運転における困難な問題として現れます。
オーディオとビデオにはセマンティックモデリング能力が不足しているため、既存の作品は主にテキストベースのマルチオブジェクトトラッキングに焦点を当ててきましたが、多くの場合、トラッキングの品質、インタラクション効率、さらには支援システムの安全性が犠牲になり、
このような手法を自動運転に応用することも考えられます。
この論文では、オーディオとビデオの融合とオーディオとビデオのトラッキングの観点から AR-MOT の問題を掘り下げます。
私たちは、デュアルストリームビジョントランスフォーマーを備えたエンドツーエンドの AR-MOT フレームワークである EchoTrack を提案しました。
デュアルストリームは、周波数ドメインと時空間ドメインの両方からオーディオとビデオの機能を双方向に融合する、双方向周波数ドメインクロスアテンションフュージョンモジュール (Bi-FCFM) と絡み合っています。
さらに、異なるオーディオオブジェクトとビデオオブジェクト間の均一な特徴を効果的に学習することにより、表現と視覚オブジェクト間の均一な意味論的特徴を抽出するオーディオビジュアル対照追跡学習（ACTL）レジームを提案します。
アーキテクチャ設計とは別に、Echo-KITTI、Echo-KITTI+、Echo-BDD などの大規模な AR-MOT ベンチマークの最初のセットを確立します。
確立されたベンチマークに関する広範な実験により、提案された EchoTrack モデルとそのコンポーネントの有効性が実証されています。
ソースコードとデータセットは https://github.com/lab206/EchoTrack で公開されます。

要約(オリジナル)

This paper introduces the task of Auditory Referring Multi-Object Tracking (AR-MOT), which dynamically tracks specific objects in a video sequence based on audio expressions and appears as a challenging problem in autonomous driving. Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking, which often comes at the cost of tracking quality, interaction efficiency, and even the safety of assistance systems, limiting the application of such methods in autonomous driving. In this paper, we delve into the problem of AR-MOT from the perspective of audio-video fusion and audio-video tracking. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. The dual streams are intertwined with our Bidirectional Frequency-domain Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and video features from both frequency- and spatiotemporal domains. Moreover, we propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract homogeneous semantic features between expressions and visual objects by learning homogeneous features between different audio and video objects effectively. Aside from the architectural design, we establish the first set of large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD. Extensive experiments on the established benchmarks demonstrate the effectiveness of the proposed EchoTrack model and its components. The source code and datasets will be made publicly available at https://github.com/lab206/EchoTrack.

arxiv情報

著者	Jiacheng Lin,Jiajun Chen,Kunyu Peng,Xuan He,Zhiyong Li,Rainer Stiefelhagen,Kailun Yang
発行日	2024-02-28 12:50:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー