Contextual Encoder-Decoder Network for Visual Saliency Prediction

要約

自然画像内の顕著な領域を予測するには、シーン内に存在するオブジェクトを検出する必要があります。
この困難なタスクに対応する堅牢な表現を開発するには、複数の空間スケールでの高レベルの視覚的特徴を抽出し、コンテキスト情報で強化する必要があります。
しかし、人間の注視マップを説明することを目的とした既存のモデルには、そのようなメカニズムが明示的に組み込まれていません。
ここでは、大規模な画像分類タスクで事前トレーニングされた畳み込みニューラルネットワークに基づくアプローチを提案します。
このアーキテクチャはエンコーダ/デコーダ構造を形成し、マルチスケールの特徴を並行してキャプチャするために、異なる拡張レートで複数の畳み込み層を備えたモジュールを含んでいます。
さらに、結果の表現をグローバルシーン情報と組み合わせて、視覚的な顕著性を正確に予測します。
私たちのモデルは、2 つの公開顕著性ベンチマークの複数の評価指標にわたって競争力のある一貫した結果を達成し、5 つのデータセットと選択された例で提案されたアプローチの有効性を実証します。
最先端のアプローチと比較して、このネットワークは軽量の画像分類バックボーンに基づいているため、複雑な自然シーン全体にわたる人間の注視を推定するための、(仮想) ロボットシステムなどの計算リソースが限られたアプリケーションに適した選択肢となります。

要約(オリジナル)

Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes.

arxiv情報

著者	Alexander Kroner,Mario Senden,Kurt Driessens,Rainer Goebel
発行日	2024-04-05 13:03:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Contextual Encoder-Decoder Network for Visual Saliency Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー