From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

要約

ドキュメントとビデオの理解、コンテキスト学習、推論時間スケーリングなど、幅広いアプリケーションには、長いコンテキスト機能が不可欠です。これらはすべて、テキストとマルチモーダルデータの長いシーケンスにわたってモデルを処理および推論する必要があります。
この作業では、Aligned Instruceモデルから超長いコンテキストLLMを構築するための効率的なトレーニングレシピを紹介し、コンテキストの長さの境界を128Kから1M、2M、および4Mトークンに押し込みます。
当社のアプローチは、コンテキストウィンドウを拡張するための効率的な継続的な事前トレーニング戦略を活用し、効果的な指導チューニングを採用して、指導の公開能力と推論能力を維持します。
llama3.1-intructに基づいてレシピで構築されたultralong-8bは、多様な長期コンテストベンチマークで最先端のパフォーマンスを実現しています。
重要なことに、私たちのアプローチで訓練されたモデルは、標準ベンチマークでの競争力のあるパフォーマンスを維持し、長いコンテキストタスクと短いコンテキストタスクの両方のバランスの取れた改善を示しています。
さらに、スケーリング戦略とデータ構成の影響を強調し、主要な設計の選択肢の詳細な分析を提供します。
私たちの調査結果は、一般的なモデル機能を維持しながら、コンテキストの長さを効率的にスケーリングするための堅牢なフレームワークを確立します。
すべてのモデルの重みをhttps://ultralong.github.io/でリリースします。

要約(オリジナル)

Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

arxiv情報

著者	Chejian Xu,Wei Ping,Peng Xu,Zihan Liu,Boxin Wang,Mohammad Shoeybi,Bo Li,Bryan Catanzaro
発行日	2025-04-08 16:58:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー