Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input

要約

イベントカメラは、低遅延で出力応答がまばらなビジョンセンサーを必要とするタスクに有利です。
しかし、ネットワークトレーニング用の大規模なラベル付きイベントカメラデータセットが不足しているため、イベントカメラを使用したディープネットワークアルゴリズムの開発は遅れています。
この論文では、text-to-X モデルを使用して新しいラベル付きイベントデータセットを作成する方法について報告します。ここで、X は 1 つまたは複数の出力モダリティ (この研究の場合はイベント) です。
私たちが提案するテキストからイベントへのモデルは、テキストプロンプトから直接合成イベントフレームを生成します。
イベントカメラ出力を表すスパースイベントフレームを生成するようにトレーニングされたオートエンコーダーを使用します。
事前トレーニングされたオートエンコーダーを拡散モデルアーキテクチャと組み合わせることで、新しいテキストからイベントへのモデルは、移動オブジェクトのスムーズな合成イベントストリームを生成できます。
オートエンコーダーは、さまざまなシーンのイベントカメラデータセットで最初にトレーニングされました。
拡散モデルと組み合わせたトレーニングでは、DVS ジェスチャデータセットが使用されました。
このモデルが、さまざまなテキストステートメントによって促される人間のジェスチャーの現実的なイベントシーケンスを生成できることを示します。
実際のデータセットでトレーニングされた分類器を使用した、生成されたシーケンスの分類精度は、ジェスチャグループに応じて 42% ～ 92% の範囲になります。
結果は、イベントデータセットの合成におけるこのメソッドの機能を示しています。

要約(オリジナル)

Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses. However, the development of deep network algorithms using event cameras has been slow because of the lack of large labelled event camera datasets for network training. This paper reports a method for creating new labelled event datasets by using a text-to-X model, where X is one or multiple output modalities, in the case of this work, events. Our proposed text-to-events model produces synthetic event frames directly from text prompts. It uses an autoencoder which is trained to produce sparse event frames representing event camera outputs. By combining the pretrained autoencoder with a diffusion model architecture, the new text-to-events model is able to generate smooth synthetic event streams of moving objects. The autoencoder was first trained on an event camera dataset of diverse scenes. In the combined training with the diffusion model, the DVS gesture dataset was used. We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements. The classification accuracy of the generated sequences, using a classifier trained on the real dataset, ranges between 42% to 92%, depending on the gesture group. The results demonstrate the capability of this method in synthesizing event datasets.

arxiv情報

著者	Joachim Ott,Zuowen Wang,Shih-Chii Liu
発行日	2024-06-05 16:34:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー