A Touch, Vision, and Language Dataset for Multimodal Alignment

要約

接触は人間にとって重要な感覚様式ですが、マルチモーダルな生成言語モデルにはまだ組み込まれていません。
これは、触覚データの自然言語ラベルを取得することが難しいことと、触覚の読み取りを視覚的観察と言語の説明の両方に合わせることが複雑であることが部分的に原因です。
このギャップを埋めるためのステップとして、この研究では、人間によって注釈が付けられた英語ラベル (10%) と GPT-4V からのテキスト疑似ラベル (90%) を含む、野生のビジョンとタッチのペアの 44K の新しいデータセットを導入しています。
。
このデータセットを使用して、オープン語彙分類のための視覚言語に合わせた触覚エンコーダーと、訓練されたエンコーダーを使用したテキスト生成のためのタッチビジョン言語 (TVL) モデルをトレーニングします。
結果は、タッチを組み込むことにより、TVL モデルがこれらのモダリティの任意のペアでトレーニングされた既存のモデルよりもタッチと視覚と言語の整合性を向上 (+29% 分類精度) することを示唆しています。
データセットのほんの一部だけが人間によってラベル付けされていますが、TVL モデルは、新しいタッチビジョンで GPT-4V (+12%) およびオープンソースの視覚言語モデル (+32%) よりも向上した視覚触覚理解力を示しています。
ベンチマークを理解する。
コードとデータ: https://tactile-vlm.github.io。

要約(オリジナル)

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

arxiv情報

著者	Letian Fu,Gaurav Datta,Huang Huang,William Chung-Ho Panitch,Jaimyn Drake,Joseph Ortiz,Mustafa Mukadam,Mike Lambeta,Roberto Calandra,Ken Goldberg
発行日	2024-02-20 18:47:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Touch, Vision, and Language Dataset for Multimodal Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー