CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

要約

視覚言語モデル (VLM) は、視覚的な指示と答えを一致させるための広範なトレーニングのおかげで、広く実用化できることが実証されています。
ただし、この決定的な調整により、モデルは重要な視覚的推論を無視し、さらに細心の注意を払った視覚的な問題や不誠実な応答で失敗する結果になります。
この論文では、VLM が一連の操作で問題を解決できるメカニズムである操作チェーンを提案します。各操作は、事前のトレーニングを通じて獲得された固有の能力 (グラウンディングなど) からの視覚入力の操作を指します。
人間のような行動（ズームインなど）を模倣しないようにします。
このメカニズムにより、VLM は視覚的な証拠に基づいた忠実な応答を生成することができ、ユーザーは解釈可能なパスでエラーの原因を追跡できるようになります。
したがって、この推論メカニズムを備えたメモリベースの互換性のあるアーキテクチャを備えた一般的な 17B VLM である CogCoM をトレーニングします。
実験の結果、私たちのモデルは 3 つのカテゴリの 8 つのベンチマークにわたって最先端のパフォーマンスを達成し、データを使用した限られた数のトレーニングステップですぐに競争力のあるパフォーマンスを獲得できることがわかりました。
コードとデータは https://github.com/THUDM/CogCoM で公開されています。

要約(オリジナル)

Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at https://github.com/THUDM/CogCoM.

arxiv情報

著者	Ji Qi,Ming Ding,Weihan Wang,Yushi Bai,Qingsong Lv,Wenyi Hong,Bin Xu,Lei Hou,Juanzi Li,Yuxiao Dong,Jie Tang
発行日	2024-02-06 18:43:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー