Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

要約

このホワイトペーパーでは、最近普及している視覚言語 (VL) モデルの常識能力の分析と改善に焦点を当てています。
大成功にもかかわらず、既存の VL モデルには、一般的な人工知能に不可欠な要素である常識的な知識/推論能力 (「レモンは酸っぱい」など) がまだ欠けていることがわかります。
分析を通じて、既存の大規模な VL データセットには常識的な知識があまり含まれていないことが重要な理由の 1 つであることを発見しました。これは、データの観点から VL モデルの常識を改善する動機になります。
新しい VL トレーニングデータセットを収集するのではなく、よりスケーラブルな戦略を提案します。
これは、トレーニング中にその場で既存の VL データセットに常識的な知識を注入できる、データ拡張手法の一種と見なすことができます。
より具体的には、常識的なナレッジグラフ (ConceptNet など) を活用し、双方向サブグラフシーケンシャル化を介して VL データセット内のテキスト記述のバリアントを作成します。
より良い常識評価のために、最初の検索ベースの常識診断ベンチマークをさらに提案します。
いくつかの代表的な VL モデルで大規模な実験を行うことにより、DANCE 手法が一般的な検索タスクのパフォーマンスを維持しながら、常識能力を大幅に改善できることを示しています。
コードとデータは https://github.com/pleaseconnectwifi/DANCE で入手できます。

要約(オリジナル)

This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., ‘Lemons are sour’), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., ‘Data Augmentation with kNowledge graph linearization for CommonsensE capability’ (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE

arxiv情報

著者	Shuquan Ye,Yujia Xie,Dongdong Chen,Yichong Xu,Lu Yuan,Chenguang Zhu,Jing Liao
発行日	2022-11-29 18:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー