A Vision-free Baseline for Multimodal Grammar Induction

要約

過去の研究では、視覚言語信号のペアにより、MSCOCO などのマルチモーダルデータセットにおける文法誘導が大幅に改善されることが示されています。
私たちは、テキストのみでトレーニングされる大規模言語モデル (LLM) の進歩が、マルチモーダル設定における文法誘導に強力な支援を提供できるかどうかを調査します。
私たちのテキストのみのアプローチである LLM ベースの C-PCFG (LC-PCFG) は、以前のマルチモーダル手法を上回り、さまざまなマルチモーダルデータセットに対して最先端の文法誘導パフォーマンスを達成することがわかりました。
画像支援文法誘導と比較して、LC-PCFG は、パラメータ数が 85% 削減され、トレーニング速度が 1.7 倍速く、従来の最先端技術よりも 7.9 ポイント優れています。
3 つのビデオ支援文法誘導ベンチマーク全体で、LC-PCFG は、8.8 倍高速なトレーニングにより、以前の最先端技術よりも最大 7.7 Corpus-F1 のパフォーマンスを上回りました。
これらの結果は、テキストのみの言語モデルには、マルチモーダルな文脈での文法誘導を支援する視覚に基づいた手がかりが含まれている可能性があるという概念に光を当てています。
さらに、私たちの結果は、マルチモーダルアプローチの利点を評価する際に、堅牢な視力のないベースラインを確立することの重要性を強調しています。

要約(オリジナル)

Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.

arxiv情報

著者	Boyi Li,Rodolfo Corona,Karttikeya Mangalam,Catherine Chen,Daniel Flaherty,Serge Belongie,Kilian Q. Weinberger,Jitendra Malik,Trevor Darrell,Dan Klein
発行日	2023-10-31 17:22:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Vision-free Baseline for Multimodal Grammar Induction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー