Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

要約

産業用、民生用、リモートセンシングのユースケースにおける IoT センサーの爆発的な増加により、ペタバイト規模のデータを送信および分析するためのコンピューティングインフラストラクチャに対する前例のない需要が生じています。
同時に、世界は徐々に、より持続可能なコンピューティングに焦点を移しつつあります。
こうした理由から、最近では、高度な洞察を生成するために、特に深層学習アルゴリズムによって、関連するコンピューティングインフラストラクチャのフットプリントを削減する取り組みが行われています。
「TinyML」コミュニティは、通信帯域幅と過剰なクラウドストレージコストを節約しながら、アルゴリズム推論の遅延を削減し、データプライバシーを促進する方法を積極的に提案しています。
このような提案されたアプローチは、複数のデータストリームが、特に粒度の細かい結果を生成する学習アルゴリズムの識別能力を向上させることが示されているため、時系列、音声、衛星画像、ビデオを含む複数の種類のデータをネットワークエッジ付近で理想的に処理する必要があります。
ちなみに、最近ではサブネットワークのデータ駆動型条件付き計算に関する研究が行われており、画像やテキストなどの非常に異なる種類の入力間でパラメーターを共有する単一モデルの使用において大きな進歩が見られ、マルチタワーのマルチモーダルネットワークの計算要件が軽減されています。
このような一連の作業に触発されて、私たちは初めてモバイルビジョントランスフォーマー (ビジョンのみの場合) 向けに同様のパッチごとの条件付き計算を検討し、最終的には単一タワーのマルチモーダルエッジモデルに使用されることになります。
きめ細かい鳥種識別データセットである Cornell Sap Sucker Woods 60 のモデルを評価します。
私たちの初期実験では、MobileViTV2-1.0 と比較して $4X$ 少ないパラメーターを使用し、SSW60 データセットの一部として提供された iNaturalist ’21 鳥類テストデータの精度は $1$% 低下しました。

要約(オリジナル)

The explosion of IoT sensors in industrial, consumer and remote sensing use cases has come with unprecedented demand for computing infrastructure to transmit and to analyze petabytes of data. Concurrently, the world is slowly shifting its focus towards more sustainable computing. For these reasons, there has been a recent effort to reduce the footprint of related computing infrastructure, especially by deep learning algorithms, for advanced insight generation. The `TinyML’ community is actively proposing methods to save communication bandwidth and excessive cloud storage costs while reducing algorithm inference latency and promoting data privacy. Such proposed approaches should ideally process multiple types of data, including time series, audio, satellite images, and video, near the network edge as multiple data streams has been shown to improve the discriminative ability of learning algorithms, especially for generating fine grained results. Incidentally, there has been recent work on data driven conditional computation of subnetworks that has shown real progress in using a single model to share parameters among very different types of inputs such as images and text, reducing the computation requirement of multi-tower multimodal networks. Inspired by such line of work, we explore similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models. We evaluate the model on Cornell Sap Sucker Woods 60, a fine grained bird species discrimination dataset. Our initial experiments uses $4X$ fewer parameters compared to MobileViTV2-1.0 with a $1$% accuracy drop on the iNaturalist ’21 birds test data provided as part of the SSW60 dataset.

arxiv情報

著者	Emmanuel Azuh Mensah,Anderson Lee,Haoran Zhang,Yitong Shan,Kurtis Heimerl
発行日	2024-11-12 14:36:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー