WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

要約

Deepseek-R1などのテキストベースの推論モデルの成功に基づいて、これらの機能をマルチモーダル推論に拡張することは大きな可能性を秘めています。
最近の作品は、DeepSeek-R1スタイルの強化学習（RL）トレーニングパラダイムをマルチモーダル大手言語モデル（MLLM）に適応させようとしましたが、数学や視覚的知覚などのドメイン固有のタスクに焦点を当てていますが、重要な質問が残っています。
この課題に対処するために、3つの重要な努力をします。（1）指定された画像から直接コンテキスト認識し、推論中心の質問（QA）ペアを自律的に生成する新しいスケーラブルなマルチモーダルQA合成パイプライン。
（2）注釈付きの推論パスを備えた120kを超えるマルチモーダルQAペアを含むオープンソースWethinkデータセット、18の多様なデータセットソースからキュレーションされ、さまざまな質問ドメインをカバーします。
（3）データセットでのRLの包括的な調査。ルールベースの検証とモデルベースの評価を組み合わせて、さまざまなタスクドメインにわたってRLトレーニング効率を最適化するハイブリッド報酬メカニズムを組み込みます。
14の多様なMLLMベンチマークにわたって、Wethinkデータセットが数学的推論から多様な一般的なマルチモーダルタスクまで、パフォーマンスを大幅に向上させることを実証します。
さらに、自動化されたデータパイプラインがデータの多様性を継続的に増加させて、モデルのパフォーマンスをさらに向上させることができることを示しています。

要約(オリジナル)

Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.

arxiv情報

著者	Jie Yang,Feipeng Ma,Zitian Wang,Dacheng Yin,Kang Rong,Fengyun Rao,Ruimao Zhang
発行日	2025-06-09 16:20:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー