Simulating and Analysing Human Survey Responses with Large Language Models: A Case Study in Energy Stated Preference

要約

調査研究は、消費者の好みを把握し、政策決定を通知することにより、研究において重要な役割を果たしています。
述べられた選好（SP）調査は、研究者が、個人が仮説的、潜在的に未来的なシナリオでトレードオフを行う方法を理解するのに役立ちます。
ただし、従来の方法は費用がかかり、時間がかかり、回答者の疲労と倫理的制約の影響を受けます。
大規模な言語モデル（LLM）は、人間のような反応を生成する際に顕著な能力を示しており、調査研究での使用に関心を促しています。
この研究では、エネルギー関連のSP調査における消費者の選択をシミュレートするためのLLMSを調査し、データ収集と分析のワークフローへの統合を調査します。
テストシナリオは、迅速なデザイン、コンテキスト学習（ICL）、チェーンオブテキスト（COT）推論、モデルタイプ、従来の選択モデルとの統合、および潜在的な偏見を考慮して、いくつかのLLMS（LLAMA 3.1、LLAMA 3.1、MISTRAL、GPT-3.5、DEEPSEEK-R1）のシミュレーションパフォーマンスを個別および集約レベルで評価するように設計されました。
LLMはランダムな推測よりも精度を達成しますが、実用的なシミュレーションの使用にはパフォーマンスが不十分です。
クラウドベースのLLMは、より小さなローカルモデルを常に上回ることはありません。
DeepSeek-R1は、最高の平均精度（77％）を達成し、精度、因子識別、および選択分布アライメントで非合理的なLLMを上回ります。
以前のSP選択は最も効果的な入力です。
より多くの要因を備えたより長いプロンプトは、精度を低下させます。
混合ロジットモデルは、LLMプロンプトの改良をサポートできます。
推論LLMSは、因子の有意性を示すことにより、データ分析の可能性を示し、統計モデルに定性的な補完を提供します。
制限にもかかわらず、事前に訓練されたLLMSはスケーラビリティを提供し、最小限の履歴データを必要とします。
将来の作業では、プロンプトを改良し、COTの推論をさらに調査し、微調整技術を調査する必要があります。

要約(オリジナル)

Survey research plays a crucial role in studies by capturing consumer preferences and informing policy decisions. Stated preference (SP) surveys help researchers understand how individuals make trade-offs in hypothetical, potentially futuristic, scenarios. However, traditional methods are costly, time-consuming, and affected by respondent fatigue and ethical constraints. Large language models (LLMs) have shown remarkable capabilities in generating human-like responses, prompting interest in their use in survey research. This study investigates LLMs for simulating consumer choices in energy-related SP surveys and explores their integration into data collection and analysis workflows. Test scenarios were designed to assess the simulation performance of several LLMs (LLaMA 3.1, Mistral, GPT-3.5, DeepSeek-R1) at individual and aggregated levels, considering prompt design, in-context learning (ICL), chain-of-thought (CoT) reasoning, model types, integration with traditional choice models, and potential biases. While LLMs achieve accuracy above random guessing, performance remains insufficient for practical simulation use. Cloud-based LLMs do not consistently outperform smaller local models. DeepSeek-R1 achieves the highest average accuracy (77%) and outperforms non-reasoning LLMs in accuracy, factor identification, and choice distribution alignment. Previous SP choices are the most effective input; longer prompts with more factors reduce accuracy. Mixed logit models can support LLM prompt refinement. Reasoning LLMs show potential in data analysis by indicating factor significance, offering a qualitative complement to statistical models. Despite limitations, pre-trained LLMs offer scalability and require minimal historical data. Future work should refine prompts, further explore CoT reasoning, and investigate fine-tuning techniques.

arxiv情報

著者	Han Wang,Jacek Pawlak,Aruna Sivakumar
発行日	2025-05-13 19:38:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Simulating and Analysing Human Survey Responses with Large Language Models: A Case Study in Energy Stated Preference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー