jarxiv | Japanese arxiv | ページ 1632

Do Multilingual LLMs Think In English?

投稿日: 2025年2月24日作成者: jarxiv

要約

大規模な言語モデル（LLM）には多言語機能があり、さまざまな言語でタスクを解決できます。
ただし、現在のLLMは、入力言語や出力言語に関係なく、英語に最も近い表現スペースで重要な決定を下すことを示しています。
フランス語、ドイツ語、オランダ語、およびマンダリンの文章のロジットレンズを使用した内部表現を調査すると、LLMは最初に、ターゲット言語に変換する前に、意味的にロードされた単語の英語に近い表現を放出することを示します。
さらに、これらのLLMのアクティベーションステアリングが、ステアリングベクトルが入力と出力の言語ではなく英語で計算される場合、より効果的であることを示します。
これは、多言語LLMがシステムユーザーに透明ではない方法で英語で大きく形作られる表現で重要な推論ステップを実行することを示唆しています。

要約(オリジナル)

Large language models (LLMs) have multilingual capabilities and can solve tasks across various languages. However, we show that current LLMs make key decisions in a representation space closest to English, regardless of their input and output languages. Exploring the internal representations with a logit lens for sentences in French, German, Dutch, and Mandarin, we show that the LLM first emits representations close to English for semantically-loaded words before translating them into the target language. We further show that activation steering in these LLMs is more effective when the steering vectors are computed in English rather than in the language of the inputs and outputs. This suggests that multilingual LLMs perform key reasoning steps in a representation that is heavily shaped by English in a way that is not transparent to system users.

arxiv情報

著者	Lisa Schut,Yarin Gal,Sebastian Farquhar
発行日	2025-02-21 17:19:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

投稿日: 2025年2月24日作成者: jarxiv

要約

大規模な言語モデル（LLM）を搭載したインテリジェントな個別指導エージェントは、言語学習や科学教育などの分野でパーソナライズされたガイダンスを提供するためにますます調査されています。
ただし、複雑な現実世界のタスクを解決するようにユーザーを導く能力は、採用不足のままです。
この制限に対処するために、この作業では、コーディングの個別指導に焦点を当てています。これは、チューターが事前定義されたコーディングタスクを完了するために生徒を積極的に導く必要がある挑戦的な問題です。
斬新なエージェントワークフローであるTrace-and-Verify（Traver）を提案します。これは、知識トレースを組み合わせて、学生の知識状態とターンバイターン検証を推定して、タスクの完了に向けた効果的なガイダンスを確保することを提案します。
制御された学生シミュレーションとコード生成テストを使用して、チューターエージェントを総合的に評価する自動評価プロトコルであるDictを紹介します。
広範な実験は、個別指導のコーディングの課題を明らかにし、Traverが大幅に高い成功率を達成することを示しています。
このホワイトペーパーの例としてコードチューターを使用していますが、結果と調査結果はコーディングを超えて拡張され、さまざまなタスクの個別指導エージェントの前進に関する貴重な洞察を提供できます。

要約(オリジナル)

Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized guidance in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students toward completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student’s knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents holistically using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our results and findings can be extended beyond coding, providing valuable insights into advancing tutoring agents for a variety of tasks.

arxiv情報

著者	Jian Wang,Yinpei Dai,Yichi Zhang,Ziqiao Ma,Wenjie Li,Joyce Chai
発行日	2025-02-21 17:25:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

On the Robustness of Transformers against Context Hijacking for Linear Classification

投稿日: 2025年2月24日作成者: jarxiv

要約

トランスベースの大手言語モデル（LLMS）は、強力なコンテキスト学習機能を実証しています。
しかし、それらの予測は、事実上正しいコンテキスト、コンテキストハイジャックとして知られる現象によって破壊される可能性があり、重大な堅牢性の問題を明らかにします。
この現象を理論的に理解するために、線形変圧器の最近の進歩に基づいて、コンテキスト内線形分類問題を調査します。
セットアップでは、コンテキストトークンは事実上正しいクエリ回答ペアとして設計されています。クエリは最終クエリに似ていますが、反対のラベルがあります。
次に、モデルの深さ、トレーニングコンテキストの長さ、ハイジャックのコンテキストトークンの数の関数として定式化される線形トランスの堅牢性に関する一般的な理論分析を開発します。
重要な発見は、よく訓練されたより深いトランスがより高い堅牢性を達成できることであり、経験的観察と一致することです。
より深い層がより微調整された最適化ステップを可能にし、コンテキストハイジャックからの干渉を効果的に軽減できるため、この改善が生じることを示します。
これは、数値実験によってもよくサポートされています。
私たちの調査結果は、より深いアーキテクチャの利点に関する理論的洞察を提供し、変圧器アーキテクチャの理解を高めることに貢献しています。

要約(オリジナル)

Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.

arxiv情報

著者	Tianle Li,Chenyang Zhang,Xingwu Chen,Yuan Cao,Difan Zou
発行日	2025-02-21 17:31:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG, stat.ML | コメントを受け付けていません

PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification

投稿日: 2025年2月24日作成者: jarxiv

要約

タンパク質翻訳後修飾（PTM）および生物活性ペプチド（BPS）は、さまざまな生物学的プロセスで重要な役割を果たし、重大な治療可能性を抱えています。
ただし、実験方法を介したPTMサイトと生物活性ペプチドを特定することは、しばしば労働集約的で、費用がかかり、時間がかかります。
その結果、特に深い学習に基づく計算ツールは、PTMサイトとペプチドの生物活性を予測するための効果的なソリューションになりました。
この分野の進歩にもかかわらず、既存の方法は、タンパク質配列の複雑さと、多様なデータセット全体で高品質の予測を必要とするという課題と依然として闘っています。
これらの問題に対処するために、ペプチド分類のためのトランスとCNNを組み合わせた、前処理されたタンパク質言語モデルをニューラルネットワークと統合する深い学習フレームワークを提案します。
タンパク質シーケンス内で複雑な関係をキャプチャする能力を活用することにより、並列ネットワークの予測力と組み合わせて、このアプローチは特徴抽出を改善しながら、予測の精度を向上させます。
このフレームワークは、PTMサイトと生物活性ペプチド予測を含む複数のタスクに適用され、大規模なデータセットを利用してモデルの堅牢性を高めました。
33のタスクにわたる比較では、モデルは25の最先端（SOTA）パフォーマンスを達成し、既存の方法を上回り、異なるデータセットでその汎用性を実証しました。
我々の結果は、このアプローチが大規模なペプチド発見とPTM分析のためのスケーラブルで効果的なソリューションを提供し、より効率的なペプチド分類と機能的注釈への道を開くことを示唆しています。

要約(オリジナル)

Protein post-translational modifications (PTMs) and bioactive peptides (BPs) play critical roles in various biological processes and have significant therapeutic potential. However, identifying PTM sites and bioactive peptides through experimental methods is often labor-intensive, costly, and time-consuming. As a result, computational tools, particularly those based on deep learning, have become effective solutions for predicting PTM sites and peptide bioactivity. Despite progress in this field, existing methods still struggle with the complexity of protein sequences and the challenge of requiring high-quality predictions across diverse datasets. To address these issues, we propose a deep learning framework that integrates pretrained protein language models with a neural network combining transformer and CNN for peptide classification. By leveraging the ability of pretrained models to capture complex relationships within protein sequences, combined with the predictive power of parallel networks, our approach improves feature extraction while enhancing prediction accuracy. This framework was applied to multiple tasks involving PTM site and bioactive peptide prediction, utilizing large-scale datasets to enhance the model’s robustness. In the comparison across 33 tasks, the model achieved state-of-the-art (SOTA) performance in 25 of them, surpassing existing methods and demonstrating its versatility across different datasets. Our results suggest that this approach provides a scalable and effective solution for large-scale peptide discovery and PTM analysis, paving the way for more efficient peptide classification and functional annotation.

arxiv情報

著者	Jixiu Zhai,Tianchi Lu,Haitian Zhong,Ziyang Xu,Yuhuan Liu,Xueying Wang,Dan Huang
発行日	2025-02-21 17:31:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: 68T07, 92C40, cs.AI, cs.LG, I.2.6 | コメントを受け付けていません

Large Language Models for Interpretable Mental Health Diagnosis

投稿日: 2025年2月24日作成者: jarxiv

要約

大規模な言語モデル（LLMS）と制約ロジックプログラミング（CLP）の強みを組み合わせたメンタルヘルス診断のための臨床意思決定支援システム（CDS）を提案します。
メンタルヘルスの専門家が使用する診断マニュアルの複雑さと診断エラーの危険性のため、CDSSを持つことは重要です。
当社のCDSSは、LLMを使用して診断マニュアルをロジックプログラムに翻訳し、既製のCLPエンジンを使用してプログラムを解決し、エンコードされたルールと提供されたデータに基づいて患者の診断を照会するソフトウェアツールです。
ドメインの専門家にLLM生成ロジックプログラムを検査する機会を提供し、必要に応じて変更を加えることにより、CDSSは診断が正確であるだけでなく解釈可能であることを保証します。
LLMを使用する2つのベースラインアプローチと実験的に比較します。LLMのみのアプローチを使用して、LLMで生成されたロジックプログラムを使用しているが専門家の検査はありません。
結果は、LLMSが候補ロジックプログラムの生成に非常に役立つが、これらのプログラムが公式の診断マニュアルへの忠実さを保証するために専門家の検査と修正が依然として必要であることを示しています。
さらに、倫理的懸念は、LLMSでの患者データの直接使用から生じ、提案された方法のようなより安全なハイブリッドアプローチの必要性を強調しています。

要約(オリジナル)

We propose a clinical decision support system (CDSS) for mental health diagnosis that combines the strengths of large language models (LLMs) and constraint logic programming (CLP). Having a CDSS is important because of the high complexity of diagnostic manuals used by mental health professionals and the danger of diagnostic errors. Our CDSS is a software tool that uses an LLM to translate diagnostic manuals to a logic program and solves the program using an off-the-shelf CLP engine to query a patient’s diagnosis based on the encoded rules and provided data. By giving domain experts the opportunity to inspect the LLM-generated logic program, and making modifications when needed, our CDSS ensures that the diagnosis is not only accurate but also interpretable. We experimentally compare it with two baseline approaches of using LLMs: diagnosing patients using the LLM-only approach, and using the LLM-generated logic program but without expert inspection. The results show that, while LLMs are extremely useful in generating candidate logic programs, these programs still require expert inspection and modification to guarantee faithfulness to the official diagnostic manuals. Additionally, ethical concerns arise from the direct use of patient data in LLMs, underscoring the need for a safer hybrid approach like our proposed method.

arxiv情報

著者	Brian Hyeongseok Kim,Chao Wang
発行日	2025-02-21 17:32:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LO | コメントを受け付けていません

Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Author’s Style

投稿日: 2025年2月24日作成者: jarxiv

要約

素晴らしい小説は、豊かなキャラクターアーク、よく構築されたプロット、微妙なライティングスタイルを備えた没入型の世界を作り出します。
ただし、現在の新しい生成方法は、多くの場合、簡潔で単純なストーリーの概要に依存しており、プレーンで一般的な言語を使用して詳細を生成します。
このギャップを埋めるために、パスティッシュ小説世代のタスクを紹介します。これは、キャラクタープロファイルの理解、もっともらしいプロット開発の予測、鮮やかで表現力のある言語を使用した具体的な詳細を書くなど、元の作品の特徴を模倣するために生成された小説を模倣する必要があります。
これを達成するために、文学的なパスティッシュの核となる側面を習得するように設計された新しい世代システムであるWriteragentを提案します。
Writeragentは、カリキュラム学習パラダイムを通じて訓練されており、低レベルのスタイルの習熟から高レベルの物語の一貫性に進みます。
その重要なタスクには、言語スタイルの学習、キャラクターモデリング、プロット計画、スタイリッシュなライティングが含まれ、包括的な物語制御が確保されます。
これをサポートするために、WriterAgentは、それぞれが異なる物語の側面に特化している階層的および累積タスク固有のモジュールを使用したLORAの拡張であるWriterloraフレームワークを活用します。
ハリー・ポッターや夢のような多言語の古典に関するWriteragentを評価し、ターゲット著者の設定、キャラクターのダイナミクス、および執筆スタイルをキャプチャする際のベースラインよりも優位性を示して、首尾一貫した忠実な物語を生み出します。

要約(オリジナル)

Great novels create immersive worlds with rich character arcs, well-structured plots, and nuanced writing styles. However, current novel generation methods often rely on brief, simplistic story outlines and generate details using plain, generic language. To bridge this gap, we introduce the task of Pastiche Novel Generation, which requires the generated novels to imitate the distinctive features of the original work, including understanding character profiles, predicting plausible plot developments, and writing concrete details using vivid, expressive language. To achieve this, we propose WriterAgent, a novel generation system designed to master the core aspects of literary pastiche. WriterAgent is trained through a curriculum learning paradigm, progressing from low-level stylistic mastery to high-level narrative coherence. Its key tasks include language style learning, character modeling, plot planning, and stylish writing, ensuring comprehensive narrative control. To support this, WriterAgent leverages the WriterLoRA framework, an extension of LoRA with hierarchical and cumulative task-specific modules, each specializing in a different narrative aspect. We evaluate WriterAgent on multilingual classics like Harry Potter and Dream of the Red Chamber, demonstrating its superiority over baselines in capturing the target author’s settings, character dynamics, and writing style to produce coherent, faithful narratives.

arxiv情報

著者	Xueran Han,Yuhan Liu,Mingzhe Li,Wei Liu,Sen Hu,Rui Yan,Zhiqiang Xu,Xiuying Chen
発行日	2025-02-21 17:40:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

投稿日: 2025年2月24日作成者: jarxiv

要約

バッチの方法で適用される大規模な言語モデル（LLM）のオンラインで動的な構造化された剪定のための新しいフレームワークであるプローブプルーニング（PP）を紹介します。
PPは、すべてのサンプルとトークンがモデルの出力に等しく寄与しているわけではないという洞察を活用し、各バッチのごく一部を調査することで重要な重みを効果的に識別し、さまざまなバッチに合わせた動的剪定を可能にします。
これは、3つの主要な段階で構成されています：プロービング、履歴に基づいた剪定、および完全な推論。
調査段階では、PPは、残りの重要性に基づいて、いくつかのモデルレイヤーを先に実行するために、残留の重要性に基づいて、小さなが重要な状態のセットを選択します。
歴史に基づいた剪定段階で、PPは調査状態を歴史的状態と戦略的に統合します。
その後、統合状態とPPの重要性スコアに基づいて構造的にプルーン化されます。これは、パフォーマンスを維持する際の各重量チャネルの重要性を評価するために特別に開発されたメトリックです。
最終段階では、残りの重みで完全な推論が行われます。
PPの主な利点は、追加のニューラルネットワークモジュールや微調整を必要とせずに動作するため、既存のモデルとの互換性です。
LLAMA-2/3およびOPTモデルでのPPの包括的な評価により、FLOPS-CANのわずか1.5％を最小限に抑えることでさえ、LLMSの構造化された剪定の効率を大幅に向上させることが明らかになりました。
たとえば、Wikitext2でLlama-2-7bで評価されると、PPは、40％の剪定比で最先端の方法と比較して、ランタイム削減の単位あたりのパフォーマンス分解の2.56倍低い比率を達成します。
私たちのコードは、https：//github.com/qi-le1/probe_pruningで入手できます。

要約(オリジナル)

We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model’s output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.

arxiv情報

著者	Qi Le,Enmao Diao,Ziyan Wang,Xinran Wang,Jie Ding,Li Yang,Ali Anwar
発行日	2025-02-21 17:41:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.LG | コメントを受け付けていません

Securing Healthcare with Deep Learning: A CNN-Based Model for medical IoT Threat Detection

投稿日: 2025年2月24日作成者: jarxiv

要約

医療インターネット（IOMT）のヘルスケアシステムへの統合の増加は、患者ケアを大幅に強化しましたが、重要なサイバーセキュリティの課題も導入しています。
この論文では、IOMT環境内のサイバー攻撃を検出するための畳み込みニューラルネットワーク（CNNS）に基づく新しいアプローチを紹介します。
主に従来の機械学習（ML）モデルまたはよりシンプルな深いニューラルネットワーク（DNNS）を利用した以前の研究とは異なり、提案されたモデルはCNNの機能を活用して、ネットワークトラフィックデータの時間的特性を効果的に分析します。
CICIOMT2024データセットでトレーニングおよび評価されたIOMTデバイスの範囲にわたる18の異なるタイプのサイバー攻撃を含む、提案されたCNNモデルは、以前の最先端の方法と比較して優れた性能を示し、バイナリで99％の完全な精度を達成しました
、カテゴリおよびマルチクラス分類タスク。
このパフォーマンスは、ロジスティック回帰、adaboost、DNNS、ランダムフォレストなどの従来のMLモデルのパフォーマンスを上回ります。
これらの調査結果は、CNNがIOMTサイバーセキュリティを大幅に改善する可能性を強調し、それによって接続されたヘルスケアシステムの保護と完全性を確保します。

要約(オリジナル)

The increasing integration of the Internet of Medical Things (IoMT) into healthcare systems has significantly enhanced patient care but has also introduced critical cybersecurity challenges. This paper presents a novel approach based on Convolutional Neural Networks (CNNs) for detecting cyberattacks within IoMT environments. Unlike previous studies that predominantly utilized traditional machine learning (ML) models or simpler Deep Neural Networks (DNNs), the proposed model leverages the capabilities of CNNs to effectively analyze the temporal characteristics of network traffic data. Trained and evaluated on the CICIoMT2024 dataset, which comprises 18 distinct types of cyberattacks across a range of IoMT devices, the proposed CNN model demonstrates superior performance compared to previous state-of-the-art methods, achieving a perfect accuracy of 99% in binary, categorical, and multiclass classification tasks. This performance surpasses that of conventional ML models such as Logistic Regression, AdaBoost, DNNs, and Random Forests. These findings highlight the potential of CNNs to substantially improve IoMT cybersecurity, thereby ensuring the protection and integrity of connected healthcare systems.

arxiv情報

著者	Alireza Mohamadi,Hosna Ghahramani,Seyyed Amir Asghari,Mehdi Aminian
発行日	2025-02-21 17:42:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CR, cs.LG | コメントを受け付けていません

Extraction multi-étiquettes de relations en utilisant des couches de Transformer

投稿日: 2025年2月24日作成者: jarxiv

要約

この記事では、フランス語のテキストでマルチラベル関係抽出のために設計された深い学習アーキテクチャであるBtransformer18モデルを紹介します。
私たちのアプローチは、Bert、Roberta、FrenchのカウンターパートのCamembert、FlaubertなどのBertファミリーの事前に訓練された言語モデルの文脈表現能力を組み合わせて、トークン間の長期的な依存関係をキャプチャするためのトランスエンコーダーの力と組み合わせています。
TextMine’25 Challengeのデータセットで実施された実験は、特にCamembert-Largeを使用する場合、Macro F1スコアを0.654で使用する場合、Flaubert-Largeで得られた結果を上回る、モデルが優れたパフォーマンスを達成することを示しています。
これらの結果は、インテリジェンスレポートで複雑な関係を自動的に抽出するためのアプローチの有効性を示しています。

要約(オリジナル)

In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family – such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT – with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine’25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.

arxiv情報

著者	Ngoc Luyen Le,Gildas Tagny Ngompé
発行日	2025-02-21 17:42:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

投稿日: 2025年2月24日作成者: jarxiv

要約

AI評価の研究により、ますます複雑で学際的になり、さまざまな背景と目的を持つ研究者を引き付けました。
その結果、発散的な評価パラダイムが出現し、しばしば単独で発展し、対立する用語を採用し、お互いの貢献を見落としています。
この断片化は、異なるパラダイムと一般大衆の両方で島の研究の軌跡とコミュニケーションの障壁をもたらし、展開されたAIシステムに対する満たされていない期待に貢献しています。
この孤立性を橋渡しするために、この論文では、AI評価環境での最近の研究を調査し、6つの主要なパラダイムを特定します。
私たちは、目標、方法論、および研究文化に関連する重要な次元にわたる各パラダイム内の主要な最近の貢献を特徴づけています。
各パラダイムに関連する質問とアプローチのユニークな組み合わせを明確にすることにより、現在の評価アプローチの幅に対する認識を高め、異なるパラダイム間の相互殺害を促進することを目指しています。
また、将来の研究の方向性を刺激するために、フィールドの潜在的なギャップを特定します。

要約(オリジナル)

Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.

arxiv情報

著者	John Burden,Marko Tešić,Lorenzo Pacchiardi,José Hernández-Orallo
発行日	2025-02-21 17:44:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.LG | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント