Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

要約

Vision Transformers (ViT) とそのマルチスケールおよび階層的なバリエーションは、画像表現の捕捉に成功しているが、その使用は一般に低解像度画像（例：- 256×256, 384384）に対して研究されてきた。計算病理学のギガピクセルホールスライドイメージング（WSI）では、WSIは20倍の倍率で150000×150000ピクセルとなり、16×16画像から組織微細環境内の相互作用を特徴づける4096×4096画像まで、様々な解像度にわたって視覚トークンの階層的構造を示すことが可能である。HIPTは、WSIに内在する自然な階層構造を活用し、2段階の自己教師付き学習を用いて高解像度の画像表現を学習するものである。HIPTは33種類のがんに対して、10,678枚のギガピクセルWSI、408,218枚の4096×4096画像、104M枚の256×256画像を用いて事前学習される。9つのスライドレベルタスクを用いてHIPT表現のベンチマークを行い、以下のことを実証した。1) 階層的な事前学習を行ったHIPTは、がんのサブタイプ分類と生存予測において現在の最先端手法を凌駕する。2) 自己教師付きViTは、腫瘍微小環境における表現型の階層的構造に関する重要な誘導的バイアスをモデル化することができる。

要約(オリジナル)

Vision Transformers (ViTs) and their multi-scale and hierarchical variations have been successful at capturing image representations but their use has been generally studied for low-resolution images (e.g. – 256×256, 384384). For gigapixel whole-slide imaging (WSI) in computational pathology, WSIs can be as large as 150000×150000 pixels at 20X magnification and exhibit a hierarchical structure of visual tokens across varying resolutions: from 16×16 images capture spatial patterns among cells, to 4096×4096 images characterizing interactions within the tissue microenvironment. We introduce a new ViT architecture called the Hierarchical Image Pyramid Transformer (HIPT), which leverages the natural hierarchical structure inherent in WSIs using two levels of self-supervised learning to learn high-resolution image representations. HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096×4096 images, and 104M 256×256 images. We benchmark HIPT representations on 9 slide-level tasks, and demonstrate that: 1) HIPT with hierarchical pretraining outperforms current state-of-the-art methods for cancer subtyping and survival prediction, 2) self-supervised ViTs are able to model important inductive biases about the hierarchical structure of phenotypes in the tumor microenvironment.

arxiv情報

著者	Richard J. Chen,Chengkuan Chen,Yicong Li,Tiffany Y. Chen,Andrew D. Trister,Rahul G. Krishnan,Faisal Mahmood
発行日	2022-06-06 14:35:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー