Token Turing Machines are Efficient Vision Models

要約

私たちは、効率的で低遅延のメモリ拡張型ビジョントランスフォーマー (ViT) である Vision Token Turing Machines (ViTTM) を提案します。
私たちのアプローチは、NLP および逐次視覚理解タスクに適用されたニューラルチューリングマシンとトークンチューリングマシンに基づいています。
ViTTM は、画像の分類やセグメンテーションなどの非順次的なコンピュータービジョンタスク向けに設計されています。
私たちのモデルは、プロセストークンとメモリトークンという 2 つのトークンセットを作成します。
プロセストークンはエンコーダーブロックを通過し、ネットワーク内の各エンコーダーブロックでメモリトークンから読み書きできるため、メモリからの情報の保存と取得が可能になります。
プロセストークンの数がメモリトークンよりも少ないことを保証することで、ネットワークの精度を維持しながら、ネットワークの推論時間を短縮することができます。
ImageNet-1K では、最先端の ViT-B のレイテンシの中央値は 529.5 ミリ秒、精度は 81.0% ですが、当社の ViTTM-B は 56% 高速 (234.1 ミリ秒)、FLOP が 2.4 分の 1 で、精度も優れています。
82.9%。
ADE20K セマンティックセグメンテーションでは、ViT-B は 13.8 フレーム/秒 (FPS) で 45.65mIoU を達成しますが、ViTTM-B モデルは 26.8 FPS (+94%) で 45.17 mIoU を達成します。

要約(オリジナル)

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

arxiv情報

著者	Purvish Jajal,Nick John Eliopoulos,Benjamin Shiue-Hal Chou,George K. Thiruvathukal,James C. Davis,Yung-Hsiang Lu
発行日	2025-01-24 17:06:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Token Turing Machines are Efficient Vision Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー