Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

要約

視覚的注意のほとんどのモデルは、さまざまな視覚検索タスクや自由視聴タスクを使用して研究されているように、トップダウンまたはボトムアップの制御を予測することを目的としています。
この論文では、両方の形式の注意制御を予測する単一モデルであるヒューマンアテンショントランスフォーマー (HAT) を提案します。
HAT は、新しいトランスベースのアーキテクチャと簡略化された中心窩網膜を使用し、人間の動的視覚作業記憶に似た時空間認識を集合的に作成します。
HAT は、ターゲットが存在する場合とターゲットが存在しない場合の視覚検索および「タスクレス」の自由観察中に行われる注視の走査経路を予測する新しい最先端技術を確立するだけでなく、人間の視線行動を解釈可能にします。
固視セルの粗いグリッドに依存し、固視の離散化による情報損失が発生する以前の方法とは異なり、HAT は逐次高密度予測アーキテクチャを特徴とし、固視ごとに密なヒートマップを出力するため、固視の離散化を回避します。
HAT は、有効性、汎用性、解釈可能性を重視する、計算上の注意における新しい基準を設定します。
HAT の実証された範囲と適用性は、注意を必要とするさまざまなシナリオにおける人間の行動をより適切に予測できる新しい注意モデルの開発を刺激する可能性があります。
コードは https://github.com/cvlab-stonybrook/HAT で入手できます。

要約(オリジナル)

Most models of visual attention aim at predicting either top-down or bottom-up control, as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT), a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and “taskless” free viewing, but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization, HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation, thus avoiding discretizing fixations. HAT sets a new standard in computational attention, which emphasizes effectiveness, generality, and interpretability. HAT’s demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT.

arxiv情報

著者	Zhibo Yang,Sounak Mondal,Seoyoung Ahn,Ruoyu Xue,Gregory Zelinsky,Minh Hoai,Dimitris Samaras
発行日	2024-03-30 18:22:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー