jarxiv | Japanese arxiv | ページ 516

PRIMER: Perception-Aware Robust Learning-based Multiagent Trajectory Planner

投稿日: 2025年5月14日作成者: jarxiv

要約

分散型マルチエージェントの軌跡計画者では、エージェントは衝突のない軌跡を生成するために、自分の位置を通信して交換する必要があります。
ただし、ローカリゼーションエラー/不確実性により、エージェント間で軌跡が完全に共有されている場合でも、軌跡の派生声が失敗する可能性があります。
この問題に対処するために、最初にPARMとPARM*を提示します。知覚を認識し、分散化された、非同期軌跡プランナーを紹介します。これにより、エージェントのチームが不確実な環境をナビゲートできるようになり、知覚情報を使用して軌跡を排出し、障害物を避けます。
PARM*は、PARMが保守的ではないため、より多くの計算を使用して最適なソリューションを見つけるため、PARMとは異なります。
これらの方法は最先端のパフォーマンスを実現しますが、船内での大きな最適化の問題を解決する必要があるため、高い計算コストに悩まされているため、エージェントが高レートで再生することが困難です。
この課題を克服するために、PARM*を専門家のデモンストレーターとして使用して模倣学習（IL）で訓練された学習ベースのプランナーである2番目の重要な貢献、Primerを提示します。
プライマーは、ニューラルネットワークの展開時に低い計算要件を活用し、最適化ベースのアプローチよりも最大5500倍高速な計算速度を達成します。

要約(オリジナル)

In decentralized multiagent trajectory planners, agents need to communicate and exchange their positions to generate collision-free trajectories. However, due to localization errors/uncertainties, trajectory deconfliction can fail even if trajectories are perfectly shared between agents. To address this issue, we first present PARM and PARM*, perception-aware, decentralized, asynchronous multiagent trajectory planners that enable a team of agents to navigate uncertain environments while deconflicting trajectories and avoiding obstacles using perception information. PARM* differs from PARM as it is less conservative, using more computation to find closer-to-optimal solutions. While these methods achieve state-of-the-art performance, they suffer from high computational costs as they need to solve large optimization problems onboard, making it difficult for agents to replan at high rates. To overcome this challenge, we present our second key contribution, PRIMER, a learning-based planner trained with imitation learning (IL) using PARM* as the expert demonstrator. PRIMER leverages the low computational requirements at deployment of neural networks and achieves a computation speed up to 5500 times faster than optimization-based approaches.

arxiv情報

著者	Kota Kondo,Claudius T. Tewari,Andrea Tagliabue,Jesus Tordesillas,Parker C. Lusk,Mason B. Peterson,Jonathan P. How
発行日	2025-05-13 17:18:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, cs.RO | コメントを受け付けていません

SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

投稿日: 2025年5月14日作成者: jarxiv

要約

注意ベースのアーキテクチャは、多変量時系列予測で優れた性能を達成していますが、計算上高価です。
パッチングや適応マスキングなどの技術は、サイズとレイテンシを減らすために開発されています。
この作業では、構造化されたプルーニングメソッド（$ \ textbf {s} $ ensitivity $ \ textbf {p} $ runer）を提案します。
以前のアプローチとは異なり、SPATは注意モジュール全体を削除することを目的としています。これにより、特殊なハードウェアを要求することなく、過剰適合のリスクを軽減し、スピードアップを可能にします。
動的感度メトリック、$ \ textbf {s} $ ensitivity $ \ textbf {e} $ nhanced $ \ textbf {n} $ ormalized $ \ textbf {d} $ ispersion（send）を提案します。
多変量データセットでの実験は、SPATが使用するモデルがMSEで2.842％、MAEで1.996％、フロップで35.274％の削減を達成することを示しています。
さらに、Spat-Prunedモデルは、標準およびゼロショット推論の両方で、既存の軽量、Mambaベース、LLMベースのSOTAメソッドよりも優れており、最も効果的な注意メカニズムのみを保持することの重要性を強調しています。
コードを公開されているhttps://anonymous.4open.science/r/spat-6042を公開しました。

要約(オリジナル)

Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $\textbf{S}$ensitivity $\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available https://anonymous.4open.science/r/SPAT-6042.

arxiv情報

著者	Suhan Guo,Jiahong Deng,Mengjun Yi,Furao Shen,Jian Zhao
発行日	2025-05-13 17:39:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG | コメントを受け付けていません

Generative Molecular Design with Steerable and Granular Synthesizability Control

投稿日: 2025年5月14日作成者: jarxiv

要約

小分子の生成設計における合成化可能性は、依然としてボトルネックのままです。
合成化可能性を考慮する既存の作業は、生成された分子の合成ルートを出力することができます。
ただし、合成の容易さに対処し、望ましい反応制約を組み込む柔軟性を可能にする際には、最小限の注意が払われています。
この作業では、操縦可能で粒状合成能力制御を可能にする小分子生成設計フレームワークを提案します。
生成された分子は、事前に定義された許可反応を含む予測された合成ルートを使用して、オプションで他のものを避けながら、任意のマルチパラメーター最適化目標を満たします。
また、すべての反応が事前に定義されたセットに属していることを強制することもできます。
最も一般的な医療化学変換全体で、これらの反応制約を混合して一致させる能力を示します。
次に、私たちのフレームワークを使用して、産業副産物をde novo最適化分子に向けて評価する方法を示します。
さらに進むと、合成可能性の制約に対する粒状制御が、超大型メイクオンデマンドライブラリの仮想スクリーニングをゆるく模倣する方法を示します。
1つのGPUのみを使用して、15K分子を生成およびドッキングして、142Bメイクオンデマンド分子を構成するFreedom 4.0の有望な候補を特定します（ライブラリの0.00001％のみを評価します）。
反応制約を満たす生成された分子には、正確な一致率が90％以上です。
最後に、最近の合成可能性に制約された生成モデルに対してフレームワークをベンチマークし、すべての分子が単一の反応型から合成可能でなければならないという追加の制約を課した場合でも、最高のサンプル効率を実証します。
主なテーマは、事前に訓練されたジェネラリストの分子生成モデルを、強化学習を通じて挑戦的な合成化可能性の制約の下で特性を最適化した小分子を生成するためにインセンティブ化できることを実証することです。

要約(オリジナル)

Synthesizability in small molecule generative design remains a bottleneck. Existing works that do consider synthesizability can output predicted synthesis routes for generated molecules. However, there has been minimal attention in addressing the ease of synthesis and enabling flexibility to incorporate desired reaction constraints. In this work, we propose a small molecule generative design framework that enables steerable and granular synthesizability control. Generated molecules satisfy arbitrary multi-parameter optimization objectives with predicted synthesis routes containing pre-defined allowed reactions, while optionally avoiding others. One can also enforce that all reactions belong to a pre-defined set. We show the capability to mix-and-match these reaction constraints across the most common medicinal chemistry transformations. Next, we show how our framework can be used to valorize industrial byproducts towards de novo optimized molecules. Going further, we demonstrate how granular control over synthesizability constraints can loosely mimic virtual screening of ultra-large make-on-demand libraries. Using only a single GPU, we generate and dock 15k molecules to identify promising candidates in Freedom 4.0 constituting 142B make-on-demand molecules (assessing only 0.00001% of the library). Generated molecules satisfying the reaction constraints have > 90% exact match rate. Lastly, we benchmark our framework against recent synthesizability-constrained generative models and demonstrate the highest sample efficiency even when imposing the additional constraint that all molecules must be synthesizable from a single reaction type. The main theme is demonstrating that a pre-trained generalist molecular generative model can be incentivized to generate property-optimized small molecules under challenging synthesizability constraints through reinforcement learning.

arxiv情報

著者	Jeff Guo,Víctor Sabanza-Gil,Zlatko Jončev,Jeremy S. Luterbacher,Philippe Schwaller
発行日	2025-05-13 17:53:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, q-bio.BM | コメントを受け付けていません

Addressing the Current Challenges of Quantum Machine Learning through Multi-Chip Ensembles

投稿日: 2025年5月14日作成者: jarxiv

要約

Quantum Machine Learning（QML）は、多様なドメイン全体で計算上の課題を解決するための大きな約束を抱いています。
ただし、その実用的な展開は、騒音、限られたスケーラビリティ、変分量子回路（VQCS）の訓練可能性の問題を含む、ノイズの多い中間スケール量子（NISQ）デバイスの制限によって制約されます。
マルチチップアンサンブルVQCフレームワークを紹介します。これは、スケーラビリティ、トレーニング性、ノイズの回復力を高めるために、より小さな量子チップ全体で高次元の計算を分割します。
このアプローチは、不毛のプラトーを軽減し、量子誤差バイアスと分散を減らし、制御されたエンタングルメントを通じて堅牢な一般化を維持することを示します。
現在および新たな量子ハードウェアに合わせて設計されたフレームワークは、標準のベンチマークデータセット（MNIST、FashionMnist、CIFAR-10）およびReal World Dataset（Physionet EEG）の実験によって検証されているように、短期デバイスでスケーラブルなQMLを有効にするための強力な可能性を示しています。

要約(オリジナル)

Quantum Machine Learning (QML) holds significant promise for solving computational challenges across diverse domains. However, its practical deployment is constrained by the limitations of noisy intermediate-scale quantum (NISQ) devices, including noise, limited scalability, and trainability issues in variational quantum circuits (VQCs). We introduce the multi-chip ensemble VQC framework, which partitions high-dimensional computations across smaller quantum chips to enhance scalability, trainability, and noise resilience. We show that this approach mitigates barren plateaus, reduces quantum error bias and variance, and maintains robust generalization through controlled entanglement. Designed to align with current and emerging quantum hardware, the framework demonstrates strong potential for enabling scalable QML on near-term devices, as validated by experiments on standard benchmark datasets (MNIST, FashionMNIST, CIFAR-10) and real world dataset (PhysioNet EEG).

arxiv情報

著者	Junghoon Justin Park,Jiook Cha,Samuel Yen-Chi Chen,Huan-Hsin Tseng,Shinjae Yoo
発行日	2025-05-13 17:57:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CE, cs.LG | コメントを受け付けていません

PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

投稿日: 2025年5月14日作成者: jarxiv

要約

機械学習（ML）モデルがハイステークスドメインでますます展開されているため、これらのモデルの安全性と信頼性を確保するためには、信頼できる不確実性の定量化（UQ）が重要です。
従来のUQメソッドは、真の生成モデルの指定に依存しており、誤解に堅牢ではありません。
一方、コンフォーマル推論は任意のMLモデルを許可しますが、モデル選択を考慮していないため、大きな間隔サイズにつながります。
YuとKumbierが提案した真正なデータサイエンスの予測可能性、計算可能性、および安定性（PCS）フレームワークに基づいてUQメソッドを提案することにより、これらの欠点に取り組みます。
具体的には、PCS-UQは予測チェックを使用して不適切なモデルをスクリーニングすることにより、モデルの選択に対処します。
PCS-UQは、これらのスクリーニングされたアルゴリズムに複数のブートストラップに適合し、サンプル間の変動性とアルゴリズムの不安定性を評価し、より信頼性の高い不確実性の推定値を可能にします。
さらに、予測セットの局所適応性を向上させる新しいキャリブレーションスキームを提案します。
17ドルの$ $回帰と6ドルの分類データセットにまたがる実験は、PCS-UQが望ましいカバレッジを達成し、コンフォーマルアプローチよりも幅を約20 \％$削減することを示しています。
さらに、当社のローカル分析によると、PCS-UQはサブグループ全体でターゲットカバレッジを達成することがよくありますが、コンフォーマルメソッドはそうしていません。
大規模な学習モデルの場合、PCS-UQの高価な複数のブートストラップトレーニングを回避する計算効率の高い近似スキームを提案します。
3つのコンピュータービジョンベンチマークで、PCS-UQは、コンフォーマルメソッドの予測セットサイズを20ドル\％$削減します。
理論的には、修正されたPCS-UQアルゴリズムは、分割コンフォーマル推論の形式であり、交換可能なデータで望ましいカバレッジを達成します。

要約(オリジナル)

As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model selection, which leads to large interval sizes. We tackle these drawbacks by proposing a UQ method based on the predictability, computability, and stability (PCS) framework for veridical data science proposed by Yu and Kumbier. Specifically, PCS-UQ addresses model selection by using a prediction check to screen out unsuitable models. PCS-UQ then fits these screened algorithms across multiple bootstraps to assess inter-sample variability and algorithmic instability, enabling more reliable uncertainty estimates. Further, we propose a novel calibration scheme that improves local adaptivity of our prediction sets. Experiments across $17$ regression and $6$ classification datasets show that PCS-UQ achieves the desired coverage and reduces width over conformal approaches by $\approx 20\%$. Further, our local analysis shows PCS-UQ often achieves target coverage across subgroups while conformal methods fail to do so. For large deep-learning models, we propose computationally efficient approximation schemes that avoid the expensive multiple bootstrap trainings of PCS-UQ. Across three computer vision benchmarks, PCS-UQ reduces prediction set size over conformal methods by $20\%$. Theoretically, we show a modified PCS-UQ algorithm is a form of split conformal inference and achieves the desired coverage with exchangeable data.

arxiv情報

著者	Abhineet Agarwal,Michael Xiao,Rebecca Barter,Omer Ronen,Boyu Fan,Bin Yu
発行日	2025-05-13 17:58:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.LG, math.ST, stat.ME, stat.ML, stat.TH | コメントを受け付けていません

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding

投稿日: 2025年5月14日作成者: jarxiv

要約

マルチモーダル大手言語モデル（MLLMS）の急速な発展により、これらのモデルのビデオ理解機能を評価するために、より多くのベンチマークが確立されています。
ただし、これらのベンチマークはスタンドアロンビデオに焦点を当てており、主に人間の行動やオブジェクト状態などの「視覚要素」を評価します。
現実には、現代のビデオは、通常、シリーズとして提示される複雑で継続的な物語を網羅することがよくあります。
この課題に対処するために、105の慎重にキュレーションされた物語主導のシリーズで構成されるベンチマークであるシリーズベンチを提案します。
具体的には、最初にさまざまなジャンルにまたがる多様なドラマシリーズのセットを選択します。
次に、新しい長期の物語注釈法を紹介し、フルインフォメーション変換アプローチと組み合わせて、手動注釈を多様なタスク形式に変換します。
シリーズ内のプロット構造とキャラクター関係の詳細な分析のためのモデル容量をさらに強化するために、新しい物語の推論フレームワークであるPC-DCOTを提案します。
シリーズベンチの広範な結果は、既存のMLLMが依然として物語主導のシリーズを理解する上で重要な課題に直面していることを示していますが、PC-DCOTにより、これらのMLLMがパフォーマンスの改善を実現することができます。
全体として、シリーズベンチとPC-DCOTは、MLLMSの将来の発展を導くために、モデル能力を高めるためのモデル機能を進めることの重要な必要性を強調しています。
シリーズベンチは、https://github.com/zackhxn/seriesbench-cvpr2025で公開されています。

要約(オリジナル)

With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on standalone videos and mainly assess ‘visual elements’ like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.

arxiv情報

著者	Chenkai Zhang,Yiming Lei,Zeming Liu,Haitao Leng,Shaoguo Liu,Tingting Gao,Qingjie Liu,Yunhong Wang
発行日	2025-05-13 08:06:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

On the Geometry of Semantics in Next-token Prediction

投稿日: 2025年5月14日作成者: jarxiv

要約

現代の言語モデルは、次のトークン予測（NTP）のみを通じて訓練されているにもかかわらず、言語的意味をキャプチャする顕著な能力を示しています。
この概念的にシンプルなトレーニング目標により、モデルが潜在的なセマンティックおよび文法の概念を抽出およびエンコードする方法を調査します。
我々の分析により、NTP最適化は、次の単語の共起パターンをキャプチャする中心的なデータスパーシティマトリックスの特異値分解（SVD）因子を介して概念をエンコードするようにモデルを暗黙的にガイドすることが明らかになりました。
モデルはこのマトリックスを明示的に構築することはありませんが、学習された単語とコンテキストの埋め込みは、それを効果的に要因にして言語構造をキャプチャします。
最も重要なSVD因子は、トレーニング中に最初に学習され、埋め込みのスペクトルクラスタリングの使用を動機付けて、クラシックKマーンと概念の解釈によって直接動機付けられた新しいオルサンベースの方法の両方を含む、人間の解釈可能なセマンティクスを特定します。
全体として、私たちの作業は分布セマンティクス、神経崩壊の幾何学、およびニューラルネットワークトレーニングのダイナミクスを橋渡しし、NTPの暗黙的バイアスが言語モデルの意味表現の出現をどのように形成するかについての洞察を提供します。

要約(オリジナル)

Modern language models demonstrate a remarkable ability to capture linguistic meaning despite being trained solely through next-token prediction (NTP). We investigate how this conceptually simple training objective leads models to extract and encode latent semantic and grammatical concepts. Our analysis reveals that NTP optimization implicitly guides models to encode concepts via singular value decomposition (SVD) factors of a centered data-sparsity matrix that captures next-word co-occurrence patterns. While the model never explicitly constructs this matrix, learned word and context embeddings effectively factor it to capture linguistic structure. We find that the most important SVD factors are learned first during training, motivating the use of spectral clustering of embeddings to identify human-interpretable semantics, including both classical k-means and a new orthant-based method directly motivated by our interpretation of concepts. Overall, our work bridges distributional semantics, neural collapse geometry, and neural network training dynamics, providing insights into how NTP’s implicit biases shape the emergence of meaning representations in language models.

arxiv情報

著者	Yize Zhao,Christos Thrampoulidis
発行日	2025-05-13 08:46:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring

投稿日: 2025年5月14日作成者: jarxiv

要約

このペーパーでは、第二言語学習の文脈における適応チューターとしての大規模な言語モデル（LLM）の可能性を調査します。
特に、システムプロンプトがLLMSを確実に制約して、学生の能力レベルに適したテキストのみを生成できるかどうかを評価します。
7Bから12Bのパラメーターまでのサイズの指導型のオープンソースLLMを使用して、スペイン語で完全な教師と学生の対話をシミュレートします。
ダイアログは、チューターと学生の役割を別々のチャット履歴で代替するLLMを持つことによって生成されます。
チューターモデルからの出力を使用して、CEFRベースのプロンプトの有効性を評価して、3つの習熟レベル（A1、B1、C1）にわたってテキストの難易度を制御します。
私たちの調査結果は、システムのプロンプトを使用してモデル出力を制約することができるが、プロンプトだけが持続的で長期的な相互作用コンテキストには脆すぎることを示唆しています。
私たちの結果は、パーソナライズされた習熟度に整合した適応チューターに対するLLMの実現可能性に関する洞察を提供し、人間の参加者なしでモデルパフォーマンスの低コストの評価のためのスケーラブルな方法を提供します。

要約(オリジナル)

This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student’s competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts – a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency-aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.

arxiv情報

著者	Mina Almasi,Ross Deans Kristensen-McLachlan
発行日	2025-05-13 08:50:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Towards Contamination Resistant Benchmarks

投稿日: 2025年5月14日作成者: jarxiv

要約

大規模な言語モデル（LLMS）の急速な発展は、自然言語処理の景観を変えました。
LLMを適切に評価することは、潜在能力を理解し、安全などの懸念に対処するために重要です。
ただし、LLM評価にはさまざまな要因が直面しており、その中で、評価の信頼性を損なう重要な問題として汚染が際立っています。
この作業では、この課題に対処するために汚染抵抗の概念を紹介します。
シフトが1の場合、シフトが1の場合は「bc」から「ab」から「ab」）に基づいたベンチマークを提案します。
さまざまな設定で広く使用されているLLMでこのベンチマークをテストします。これらのモデルは、汚染が制御されるとこのベンチマークに苦労していることがわかります。
私たちの調査結果は、現在のLLMの問題を明らかにし、それらの真の能力に関する重要な質問を提起します。
私たちの仕事は、汚染耐性ベンチマークの開発に貢献し、より厳格なLLM評価を可能にし、LLMの真の機能と制限に関する洞察を提供します。

要約(オリジナル)

The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., ‘ab’ to ‘bc’ when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.

arxiv情報

著者	Rahmatullah Musawi,Sheng Lu
発行日	2025-05-13 09:35:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

投稿日: 2025年5月14日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、人工知能において極めて重要になり、推論、理解、および生成の強力な能力を示しています。
ただし、エッジデバイスでの展開は相当なサイズによって妨げられ、多くの場合数億パラメーターに達します。
量子化は、メモリの使用量と推論時間を短縮するために広く使用されている方法ですが、LLMSは、その活性化における外れ値の有病率のために独自の課題を提示します。
この作業では、ランダム回転行列上のHadamard Matricesの理論的利点を活用して、LLMSの量子化の境界を押し広げます。
Hadamard Matricesは、低ビットの量子化を達成する上で重要な障害である外れ値を減らすのに効果的であることを示しています。
漸進的なバイナリ検索に基づく方法により、重み、活性化、キー価値（kV）キャッシュの3ビット量子化により、SOTAメソッドと比較して一般的なベンチマークの精度が40％増加します。
Paley Algorithmを使用することにより、Qwenアーキテクチャと同様に、回転行列の使用を拡張して、Qwenアーキテクチャと同様に、Qwenアーキテクチャと同様にサポートします。
我々は、外れ値を減らす際のハダマードマトリックスの優位性を理論的に実証します。重み、活性化、およびKVキャッシュの3ビット量子化を達成し、モデルのパフォーマンスを大幅に向上させました。
Mistral、Llama、Qwenなどの複数のモデルファミリでの実験結果は、既存の方法を上回り、実用的な3ビット量子化を可能にし、アプローチの有効性を示しています。

要約(オリジナル)

Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

arxiv情報

著者	Lucas Maisonnave,Cyril Moineau,Olivier Bichler,Fabrice Rastello
発行日	2025-05-13 09:36:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント