The First Large-Scale Benchmark for Simulating Human Behavior with LLMs
We combine 20 datasets in a unified format covering diverse domains:
A train will kill 5 people on the track. You can flip a switch to divert the train to a side track where it will kill just 2 people.
What do you do?
Each dataset contains multiple-choice questions.
We test the ability of LLMs to simulate group-level response distributions by comparing model predictions against human ground truth.
We measure how closely LLM predictions match human response distributions
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results.
To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail.
Even the best LLMs today have limited simulation ability, with a top score of 40.80/100, though performance scales log-linearly with model size.
Simulation performance is not improved by increased inference-time compute or Chain-of-Thought prompting.
Instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones.
Models particularly struggle when simulating specific demographic groups, especially those defined by religion and ideology.
Simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939).
SimBench includes participants from 130+ countries across six continents, enabling evaluation of diverse cultural contexts.
Measures the ability of LLMs to simulate responses of broad and diverse human populations. Each test case uses the dataset-specific default grouping (e.g., "US-based Amazon Mechanical Turk workers").
Measures the ability of LLMs to simulate responses from specific participant groups defined by demographics (e.g., age, gender, ideology). Uses large-scale survey datasets with rich sociodemographic information.
Access SimBench on Hugging Face to download the benchmark splits.
Download DatasetGet the evaluation scripts and tools from our GitHub repository.
git clone https://github.com/pitehu/SimBench_release.git
                Use the provided scripts to evaluate your LLM's simulation capabilities.
python generate_answers.py --input_file SimBenchPop.pkl --output_file results.pkl --model_name your-model
                Upload your results to our interactive explorer to visualize and analyze performance.
Open Explorer@article{hu2025simbench,
  title={SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors},
  author={Hu, Tiancheng and others},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}For questions, feedback, or collaboration opportunities, please contact: