Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

要約

Dirated Convolution with Learningable Spacing (DCLS) は、拡張コンボリューションと同様にパラメーターの数を増やすことなく、しかも規則的なグリッドを課すことなく、受容野 (RF) を拡大できる最新の高度なコンボリューション手法です。
DCLS は、いくつかのコンピュータービジョンベンチマークで、標準畳み込みおよび拡張畳み込みよりも優れたパフォーマンスを示すことが示されています。
ここでは、さらに DCLS が、人間の視覚戦略との整合性として定義されるモデルの解釈可能性を高めることを示します。
これを定量化するために、人間の視覚的注意を反映するモデルの GradCAM ヒートマップと ClickMe データセットヒートマップの間の Spearman 相関関係を使用します。
ResNet50、ConvNeXt (T、S、B)、CAFormer、ConvFormer、FastViT (sa 24 および 36) の 8 つの参照モデルを採用し、標準の畳み込み層を DCLS 層にドロップインで置き換えました。
これにより、そのうち 7 つで解釈可能性スコアが向上しました。
さらに、研究では、Grad-CAM が CAFormer モデルと ConvFormer モデルの 2 つのモデルに対してランダムなヒートマップを生成し、解釈可能性スコアが低いことが観察されました。
私たちは、ほぼすべてのモデルにわたって解釈可能性を強化する Grad-CAM の上に構築された修正である Threshold-Grad-CAM を導入することで、この問題に対処しました。
この調査を再現するためのコードとチェックポイントは、https://github.com/rabihchamas/DCLS-GradCAM-Eval で入手できます。

要約(オリジナル)

Dilated Convolution with Learnable Spacing (DCLS) is a recent advanced convolution method that allows enlarging the receptive fields (RF) without increasing the number of parameters, like the dilated convolution, yet without imposing a regular grid. DCLS has been shown to outperform the standard and dilated convolutions on several computer vision benchmarks. Here, we show that, in addition, DCLS increases the models’ interpretability, defined as the alignment with human visual strategies. To quantify it, we use the Spearman correlation between the models’ GradCAM heatmaps and the ClickMe dataset heatmaps, which reflect human visual attention. We took eight reference models – ResNet50, ConvNeXt (T, S and B), CAFormer, ConvFormer, and FastViT (sa 24 and 36) – and drop-in replaced the standard convolution layers with DCLS ones. This improved the interpretability score in seven of them. Moreover, we observed that Grad-CAM generated random heatmaps for two models in our study: CAFormer and ConvFormer models, leading to low interpretability scores. We addressed this issue by introducing Threshold-Grad-CAM, a modification built on top of Grad-CAM that enhanced interpretability across nearly all models. The code and checkpoints to reproduce this study are available at: https://github.com/rabihchamas/DCLS-GradCAM-Eval.

arxiv情報

著者	Rabih Chamas,Ismail Khalfaoui-Hassani,Timothee Masquelier
発行日	2024-08-06 13:05:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dilated Convolution with Learnable Spacings makes visual models more aligned with humans: a Grad-CAM study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー