jarxiv | Japanese arxiv | ページ 1366

Investigating User Perspectives on Differentially Private Text Privatization

投稿日: 2025年3月13日作成者: jarxiv

要約

最近の文献では、$ \ textit {差別的に私的な自然言語処理} $（DP NLP）でかなりの増加が見られました。
これには、DPテキストの民営化が含まれます。ここでは、潜在的に機密の入力テキストがDPの下で変換され、理想的には機密情報$ \ TextIT {および} $をマスクする民営化された出力テキストを実現します。
DPテキストの民営化におけるオープンな課題に対処するための継続的な作業にもかかわらず、このテクノロジーのユーザー認識に対処する作業が不足しています。これは、実際の採用に対する最終的な障壁として機能する重要な側面です。
この作業では、世界中の721のレイパーソンを使用した調査研究を実施し、$ \ textit {sinario} $、$ \ textit {data sensitivity} $、$ \ textit {メカニズムタイプ} $、および$ \ textit {データ収集の理由} $のユーザー好みに影響を与える方法を調査します。
これらすべての要因がプライバシーの決定に影響を与えるのに役割を果たしている一方で、ユーザーはプライベート出力テキストのユーティリティと一貫性に非常に敏感であることがわかります。
私たちの調査結果は、DP NLPの研究で考慮しなければならない社会技術的要因を強調し、今後のさらなるユーザーベースの調査への扉を開きます。

要約(オリジナル)

Recent literature has seen a considerable uptick in $\textit{Differentially Private Natural Language Processing}$ (DP NLP). This includes DP text privatization, where potentially sensitive input texts are transformed under DP to achieve privatized output texts that ideally mask sensitive information $\textit{and}$ maintain original semantics. Despite continued work to address the open challenges in DP text privatization, there remains a scarcity of work addressing user perceptions of this technology, a crucial aspect which serves as the final barrier to practical adoption. In this work, we conduct a survey study with 721 laypersons around the globe, investigating how the factors of $\textit{scenario}$, $\textit{data sensitivity}$, $\textit{mechanism type}$, and $\textit{reason for data collection}$ impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts. Our findings highlight the socio-technical factors that must be considered in the study of DP NLP, opening the door to further user-based investigations going forward.

arxiv情報

著者	Stephen Meisenbacher,Alexandra Klymenko,Alexander Karpp,Florian Matthes
発行日	2025-03-12 12:33:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.HC | コメントを受け付けていません

An Evaluation of LLMs for Detecting Harmful Computing Terms

投稿日: 2025年3月13日作成者: jarxiv

要約

技術的なコンテキストで有害および非包括的な用語を検出することは、コンピューティングに包括的な環境を促進するために重要です。
この研究では、それぞれが特定のユースケースと組み合わせる技術用語のキュレーションデータベースを評価することにより、モデルアーキテクチャが有害な言語検出に与える影響を調査します。
Bert-Base-Uncased、Roberta Large-Mnli、Gemini Flash 1.5および2.0、GPT-4、Claude AI Sonnet 3.5、T5-Large、およびBart-Large-Mnliなど、エンコーダー、デコーダー、およびエンコーダーデコーダー言語モデルの範囲をテストしました。
各モデルには、64の用語で有害および非包括的な言語を識別するための標準化されたプロンプトが提示されました。
結果は、デコーダーモデル、特にGemini Flash 2.0とClaude AIが微妙なコンテキスト分析で優れていることを明らかにし、Bertのようなエンコーダーモデルは強力なパターン認識を示しますが、分類の確実性に苦労しています。
自動化された検出ツールを改善するためのこれらの調査結果の意味と、技術ドメインでの包括的コミュニケーションを促進する際のモデル固有の強みと制限を強調します。

要約(オリジナル)

Detecting harmful and non-inclusive terminology in technical contexts is critical for fostering inclusive environments in computing. This study explores the impact of model architecture on harmful language detection by evaluating a curated database of technical terms, each paired with specific use cases. We tested a range of encoder, decoder, and encoder-decoder language models, including BERT-base-uncased, RoBERTa large-mnli, Gemini Flash 1.5 and 2.0, GPT-4, Claude AI Sonnet 3.5, T5-large, and BART-large-mnli. Each model was presented with a standardized prompt to identify harmful and non-inclusive language across 64 terms. Results reveal that decoder models, particularly Gemini Flash 2.0 and Claude AI, excel in nuanced contextual analysis, while encoder models like BERT exhibit strong pattern recognition but struggle with classification certainty. We discuss the implications of these findings for improving automated detection tools and highlight model-specific strengths and limitations in fostering inclusive communication in technical domains.

arxiv情報

著者	Joshua Jacas,Hana Winchester,Alicia Boyd,Brittany Johnson
発行日	2025-03-12 12:36:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.ET | コメントを受け付けていません

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts

投稿日: 2025年3月13日作成者: jarxiv

要約

大規模な言語モデル（LLM）は、生成されたコンテンツの安全性を評価するために自動評価者としてますます採用されていますが、この役割におけるそれらの信頼性は不確実なままです。
この研究では、重要な安全性ドメイン全体の11のLLMジャッジモデルの多様なセットを評価し、3つの重要な側面を調べます。繰り返し判断のタスクにおける自己整合性、人間の判断との整合性、および謝罪や冗長なフレージングなどのアーティファクトを入力する可能性。
私たちの調査結果は、LLM審査員のバイアスが、コンテンツソースがより安全である最終的な評決を大幅に歪め、比較評価の妥当性を損なうことができることを明らかにしています。
特に、謝罪の言語アーティファクトだけで、評価者の好みを最大98 \％でゆがめることができます。
期待に反して、より大きなモデルは一貫してより大きな堅牢性を示すものではありませんが、より小さなモデルは特定のアーティファクトに対してより高い抵抗を示すことがあります。
LLM評価者の堅牢性の問題を緩和するために、複数のモデルからの決定を集約するju審ベースの評価を調査します。
このアプローチは堅牢性を向上させ、人間の判断との整合性を高めますが、アーティファクトの感度は最高のju審員構成でも持続します。
これらの結果は、信頼できる安全性評価を確保するために、多様化されたアーティファクト耐性の方法論の緊急の必要性を強調しています。

要約(オリジナル)

Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.

arxiv情報

著者	Hongyu Chen,Seraphina Goldfarb-Tarrant
発行日	2025-03-12 12:49:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding

投稿日: 2025年3月13日作成者: jarxiv

要約

大規模なマルチモーダルモデル（LMM）は、ビジョン言語（VL）タスクのジェネラリストとして重要な可能性を示しています。
ただし、基本的なVL機能の組み合わせを必要とする複雑なタスクと、複雑な命令の接地を含むタスクに関しては、最先端のLMMと人間のパフォーマンスとの間には大きなギャップが残っています。
ヒト-lmmギャップとその根本的な原因を徹底的に調査するために、LMMSに挑戦する複雑な実世界のVLタスクを備えた多様なベンチマークであるMoatを提案します。
具体的には、MOATのタスクでは、テキストの読み取り、カウント、空間関係の理解、テキストおよび視覚的指示の接地などの基本的なVL機能を統合することにより、LMMが一般主義の問題解決に関与する必要があります。これらすべての能力は、10の基本的なVL能力を含む米国が提案する分類に適合し、MOATを促進し、微妙な見方を提供します。
その上、Moatは、多くの実際のアプリケーションに不可欠な複雑なテキストと視覚的指示を接地するLMMSの能力を明示的に評価する最初のベンチマークです。
私たちは、20を超える独自およびオープンソースのLMM、および人間を堀で評価し、人間が82.7％の精度を達成し、最高のパフォーマンスのLMM（Openai O1）は38.8％しか達成されなかったことがわかりました。
将来のモデル開発を導くために、結果の一般的な傾向を分析し、LMMSと人間の間で観察されたパフォーマンスギャップの根本的な原因を議論します。どのVL能力が複雑なタスクのボトルネックを形成するか、テスト時間スケーリングがMOATのパフォーマンスを改善するかどうか、LMMの能力をカウントするかどうかに焦点を当てます。
コードとデータは、https：//cambrian-yzt.github.io/moatで入手できます。

要約(オリジナル)

Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs’ strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs’ ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs’ capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.

arxiv情報

著者	Zhoutong Ye,Mingze Sun,Huan-ang Gao,Chun Yu,Yuanchun Shi
発行日	2025-03-12 12:49:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.CV | コメントを受け付けていません

RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports

投稿日: 2025年3月13日作成者: jarxiv

要約

臨床報告の標準化は、ヘルスケアの品質を改善し、データ統合を促進するために重要です。
フォーマット、用語、スタイルなどの統一された基準の欠如は、臨床眼底診断レポートの大きな課題であり、データを理解するための大規模な言語モデル（LLM）の難易度を高めます。
これに対処するために、Fundus臨床用語と臨床診断で一般的に使用される説明を含むバイリンガル標準用語を構築します。
次に、RetSTA-7B-ZeroとRetsta-7Bの2つのモデルを確立します。
臨床シナリオをシミュレートする拡張データセットで微調整されたRetSTA-7B-Zeroは、強力な標準化行動を実証します。
ただし、より幅広い病気をカバーするための制限の課題に遭遇します。
標準化パフォーマンスをさらに強化するために、RetSTA-7Bを構築します。これにより、RetSTA-7B-ゼロと対応する英語データによって生成されるかなりの量の標準化されたデータが統合され、多様な複雑な臨床シナリオをカバーし、初めてレポートレベルの標準化を達成します。
実験結果は、RetSTA-7Bがバイリンガル標準化タスクで他の比較LLMを上回ることを示しています。これにより、その優れたパフォーマンスと一般化可能性が検証されます。
チェックポイントは、https：//github.com/ab-story/retsta-7bで入手できます。

要約(オリジナル)

Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.

arxiv情報

著者	Jiushen Cai,Weihang Zhang,Hanruo Liu,Ningli Wang,Huiqi Li
発行日	2025-03-12 13:00:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

In Context Learning and Reasoning for Symbolic Regression with Large Language Models

投稿日: 2025年3月13日作成者: jarxiv

要約

大規模な言語モデル（LLMS）は、トランスベースの機械学習モデルであり、明示的に訓練されていないタスクで顕著なパフォーマンスを示しています。
ここでは、LLMSがシンボリック回帰を実行する可能性を調査します。これは、データセットからシンプルで正確な方程式を見つけるための機械学習方法です。
GPT-4に、データからの式を提案するように促します。データは、外部Pythonツールを使用して最適化および評価されます。
これらの結果はGPT-4に供給され、複雑さと損失を最適化しながら、改善された表現を提案しています。
チェーンオブサブのプロンプトを使用して、GPT-4に、新しい表現を生成する前に、各問題についてデータ、以前の表現、および科学的文脈（自然言語で表現）を分析するよう指示します。
実験データからの5つのよく知られた科学方程式の再発見でワークフローを評価し、既知の方程式のない追加のデータセットで評価しました。
GPT-4は、5つの方程式すべてを再発見し、一般に、スクラッチパッドを使用して科学的コンテキストを検討するように促されると、より良くパフォーマンスを発揮しました。
戦略的プロンプトがモデルのパフォーマンスを改善する方法と、自然言語インターフェイスが理論とデータと統合の統合をどのように単純化するかを示します。
また、理論が騒々しいデータを時々相殺することができることを観察し、他のケースでは、データが貧弱なコンテキストを補うことができます。
このアプローチは、ターゲット方程式がより複雑な確立されたSRプログラムを上回ることはありませんが、それでもLLMは、指示に従って科学的文脈を自然言語に組み込む一方で、改善されたソリューションに向けて反復することができます。

要約(オリジナル)

Large Language Models (LLMs) are transformer-based machine learning models that have shown remarkable performance in tasks for which they were not explicitly trained. Here, we explore the potential of LLMs to perform symbolic regression — a machine-learning method for finding simple and accurate equations from datasets. We prompt GPT-4 to suggest expressions from data, which are then optimized and evaluated using external Python tools. These results are fed back to GPT-4, which proposes improved expressions while optimizing for complexity and loss. Using chain-of-thought prompting, we instruct GPT-4 to analyze the data, prior expressions, and the scientific context (expressed in natural language) for each problem before generating new expressions. We evaluated the workflow in rediscovery of five well-known scientific equations from experimental data, and on an additional dataset without a known equation. GPT-4 successfully rediscovered all five equations, and in general, performed better when prompted to use a scratchpad and consider scientific context. We demonstrate how strategic prompting improves the model’s performance and how the natural language interface simplifies integrating theory with data. We also observe how theory can sometimes offset noisy data and, in other cases, data can make up for poor context. Although this approach does not outperform established SR programs where target equations are more complex, LLMs can nonetheless iterate toward improved solutions while following instructions and incorporating scientific context in natural language.

arxiv情報

著者	Samiha Sharlin,Tyler R. Josephson
発行日	2025-03-12 13:14:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL | コメントを受け付けていません

Evaluating Automatic Speech Recognition Systems for Korean Meteorological Experts

投稿日: 2025年3月13日作成者: jarxiv

要約

このペーパーでは、自動音声認識（ASR）を自然言語クエリシステムに統合して、韓国の気象学者の気象予測効率を改善することを調査します。
韓国気象領域、特に専門的な語彙と韓国の言語の複雑さのためのASRシステムの開発における課題に対処します。
これらの問題に取り組むために、ネイティブの韓国語話者が記録した音声クエリの評価データセットを作成しました。
このデータセットを使用して、多言語ASRモデルファミリのさまざまな構成を評価し、ドメイン固有の用語に関連するパフォーマンスの制限を特定しました。
次に、単純なテキストからスピーチベースのデータ増強方法を実装しました。これにより、一般的なドメインのパフォーマンスを維持しながら、専門用語の認識が向上しました。
貢献には、ドメイン固有のデータセットの作成、包括的なASRモデル評価、および効果的な増強技術が含まれます。
私たちの仕事は、韓国の気象予測領域のASRにおける将来の進歩の基盤を提供していると考えています。

要約(オリジナル)

This paper explores integrating Automatic Speech Recognition (ASR) into natural language query systems to improve weather forecasting efficiency for Korean meteorologists. We address challenges in developing ASR systems for the Korean weather domain, specifically specialized vocabulary and Korean linguistic intricacies. To tackle these issues, we constructed an evaluation dataset of spoken queries recorded by native Korean speakers. Using this dataset, we assessed various configurations of a multilingual ASR model family, identifying performance limitations related to domain-specific terminology. We then implemented a simple text-to-speech-based data augmentation method, which improved the recognition of specialized terms while maintaining general-domain performance. Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. We believe our work provides a foundation for future advancements in ASR for the Korean weather forecasting domain.

arxiv情報

著者	ChaeHun Park,Hojun Cho,Jaegul Choo
発行日	2025-03-12 13:18:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

投稿日: 2025年3月13日作成者: jarxiv

要約

表現エンジニアリング（REPE）は、LLMSの挙動を制御するための新しいパラダイムです。
入力を変更したりモデルを微調整したりする従来のアプローチとは異なり、モデルの内部表現を直接操作します。
その結果、モデルの動作をより効果的で、解釈可能なデータ効率が高く、柔軟な制御を提供する可能性があります。
LLMSのRepeの最初の包括的な調査を提示し、急速に成長している文献をレビューして重要な質問に対処します。
どのような概念と問題が適用されていますか？
他の方法と比較して、REPEの長所と短所は何ですか？
これらに答えるために、表現の識別、運用化、および制御を含むパイプラインとしてRepeを説明する統一されたフレームワークを提案します。
Repe Methodは重要な可能性を提供しますが、複数の概念の管理、信頼性の確保、モデルのパフォーマンスの維持など、課題が残っていると仮定します。
Repeの改善に向けて、実験的および方法論的な改善の機会を特定し、ベストプラクティスのガイドを構築します。

要約(オリジナル)

Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model’s internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models’ behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models’ performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

arxiv情報

著者	Jan Wehner,Sahar Abdelnabi,Daniel Tan,David Krueger,Mario Fritz
発行日	2025-03-12 13:31:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.LG | コメントを受け付けていません

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

投稿日: 2025年3月13日作成者: jarxiv

要約

データセット重複排除は、データの品質を向上させ、最終的に大規模な言語モデルのトレーニングパフォーマンスと効率を改善する上で重要な役割を果たします。
データ重複排除に一般的に使用される方法は、Minhash LSHアルゴリズムです。
最近、NvidiaはGPUベースのMinhash LSH重約方ー法を導入しましたが、最適ではないままであり、処理効率をさらに改善する余地を残しています。
このペーパーでは、GPUクラスターのMinhash LSHを最適化し、計算効率が高い部分的に再利用可能な非暗号化可能なハッシュ関数のレバレッジを最適化するGPU加速重複排除フレームワークを提案します。
FRBは、SlimpajamaのCPUベースの重複排除ツール（64の論理CPUコアを使用）を最大107.2回、Nvidia NemoキュレーターのGPUベースのツールを4つのGPUを使用したノードで3,000万ドキュメントを処理すると最大6.3倍上回ります。
特に、私たちの方法は、以前に時間がかかっていたMinhashの署名生成フェーズを劇的に加速し、CPUベースラインと比較して最大260のスピードアップを達成します。
これらの効率性の向上にもかかわらず、FRBは高い重複排除の品質を維持し、重複したドキュメントセットは、標準のMinhashアルゴリズムで識別されたものと比較して、0.96を超えるジャッカードの類似性に達します。
大規模な実験では、1.2兆トークンの重複排除は、4ノードの16-gPU環境でわずか6時間で完了します。
関連コードは、github（\ href {https://github.com/mcrl/fed} {https://github.com/mcrl/fed}）で公開されています。

要約(オリジナル)

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).

arxiv情報

著者	Youngjun Son,Chaewon Kim,Jaejin Lee
発行日	2025-03-12 13:36:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL | コメントを受け付けていません

Multimodal Programming in Computer Science with Interactive Assistance Powered by Large Language Model

投稿日: 2025年3月13日作成者: jarxiv

要約

LLMチャットボットインターフェイスにより、学生は宿題を即座にインタラクティブな支援を得ることができますが、不注意に教育目標を進めることはできません。
この研究では、DeepSeek R1に基づくインタラクティブな宿題ヘルプシステムが開発され、最初に大規模なコンピューターサイエンスの開始プログラミングコースに登録されている学生向けに実装されています。
有名なコードエディターのアシストボタンに加えて、アシスタントにはコマンドライン自動評価者にフィードバックオプションもあります。
学生の作業は、すぐに回答を提供せずに教育目標を進めるパーソナライズされたプロンプトでラップします。
私たちは、アシスタントが生徒の概念的な困難を認識し、教育的に適切な方法でアイデア、計画、テンプレートコードを提供できることを発見しました。
ただし、他の間違いの中でも、正しい学生コードが誤っていると誤ってラベル付けされたり、学生が正しいとはっきりと不適切なアプローチを使用するように誤ってラベルを付けたり、学生にとって長くイライラする旅につながる可能性があります。
多くの開発と展開の問題について議論した後、結論と将来の行動を提供します。

要約(オリジナル)

LLM chatbot interfaces allow students to get instant, interactive assistance with homework, but doing so carelessly may not advance educational objectives. In this study, an interactive homework help system based on DeepSeek R1 is developed and first implemented for students enrolled in a large computer science beginning programming course. In addition to an assist button in a well-known code editor, our assistant also has a feedback option in our command-line automatic evaluator. It wraps student work in a personalized prompt that advances our educational objectives without offering answers straight away. We have discovered that our assistant can recognize students’ conceptual difficulties and provide ideas, plans, and template code in pedagogically appropriate ways. However, among other mistakes, it occasionally incorrectly labels the correct student code as incorrect or encourages students to use correct-but-lesson-inappropriate approaches, which can lead to long and frustrating journeys for the students. After discussing many development and deployment issues, we provide our conclusions and future actions.

arxiv情報

著者	Rajan Das Gupta,Md. Tanzib Hosain,M. F. Mridha,Salah Uddin Ahmed
発行日	2025-03-12 13:42:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CL, cs.HC | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント