Vision-centric Token Compression in Large Language Model

要約

大規模言語モデル（LLM）は自然言語処理に革命をもたらし、より長いシーケンスの処理に優れている。しかし、拡張された文脈内トークンの処理における非効率性と冗長性は依然として課題である。この問題に対処する多くの試みは、より小さなテキストエンコーダでトークンを圧縮することに依存しているが、テキストエンコーダが本当に必要不可欠かどうかは疑問である。しかし、我々はテキストエンコーダが本当に必要なのかどうか疑問を持っている。我々の旅は予期せぬ発見につながる。それは、テキストトークンのシーケンスに直接適用される、はるかに小さなビジョンエンコーダが、テキストタスクにおいてテキストエンコーダに匹敵することができるということである。大量のデータで事前に訓練し、複数の中規模または小規模のテキスト理解ベンチマークに転送すると、VISTは16%少ないFLOPsと50%少ないメモリ使用量で同等の結果を導く。さらに、トークンの冗長性を発見し、視覚エンコーダの焦点を最も重要なトークンに導くために、周波数ベースのマスキング戦略を考案した。興味深いことに、学習された視覚エンコーダは要約器のように動作し、前置詞や接続詞のような重要度の低い単語を選択的に無視する。このアプローチは驚くべき結果をもたらし、TriviaQA、NQ、PopQA、TREF、SST2、SST5などのベンチマークにおいて、従来のテキストエンコーダーベースの手法を平均5.7%上回り、LLMにおけるトークン効率の新たな基準を打ち立てた。

要約(オリジナル)

Large Language Models (LLMs) have revolutionized natural language processing, excelling in handling longer sequences. However, the inefficiency and redundancy in processing extended in-context tokens remain a challenge. Many attempts to address this rely on compressing tokens with smaller text encoders, yet we question whether text encoders are truly indispensable. Our journey leads to an unexpected discovery-a much smaller vision encoder, applied directly to sequences of text tokens, can rival text encoders on text tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small text understanding benchmarks, VIST leads to comparable results with 16% fewer FLOPs and 50% less memory usage. We further uncover significant token redundancy and devise a frequency-based masking strategy to guide the focus of the visual encoder toward the most critical tokens. Interestingly, we observe the trained visual encoder performs like a summarizer, selectively ignoring less important words such as prepositions and conjunctions. This approach delivers remarkable results, outperforming traditional text encoder-based methods by 5.7% on average over benchmarks like TriviaQA, NQ, PopQA, TREF, SST2, and SST5, setting a new standard for token efficiency in LLMs.

arxiv情報

著者	Ling Xing,Alex Jinpeng Wang,Rui Yan,Jinhui Tang
発行日	2025-02-04 11:45:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Vision-centric Token Compression in Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー