SimBench: Benchmarking LLM Human Behavior Simulation

20 Datasets

We combine 20 datasets in a unified format covering diverse domains:

ChaosNLI MoralMachine Choices13k AfroBarometer OpinionQA OSPsychBig5 NumberGame DICES990 WisdomOfCrowds Jester LatinoBarometro ISSP ...and more

Multiple-Choice Questions

A train will kill 5 people on the track. You can flip a switch to divert the train to a side track where it will kill just 2 people.

What do you do?

A: Flip the switch

B: Do nothing

Each dataset contains multiple-choice questions.

Group-Level Responses

We test the ability of LLMs to simulate group-level response distributions by comparing model predictions against human ground truth.

Human Ground Truth

👨🏻 👩🏼 👨🏽

A: 60%B: 40%

→

LLM Prediction

A: 55%B: 45%

We measure how closely LLM predictions match human response distributions

Overview

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results.

To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail.

Key Findings

Limited Current Performance

Even the best LLMs today have limited simulation ability, with a top score of 40.80/100, though performance scales log-linearly with model size.

No Benefit from Inference-Time Compute

Simulation performance is not improved by increased inference-time compute or Chain-of-Thought prompting.

Alignment-Simulation Trade-off

Instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones.

Demographic Disparities

Models particularly struggle when simulating specific demographic groups, especially those defined by religion and ideology.

Strong Correlation with Knowledge

Simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939).

Global Diversity

SimBench includes participants from 130+ countries across six continents, enabling evaluation of diverse cultural contexts.

Benchmark Splits

SimBenchPop

7,167 test cases 20 datasets

Measures the ability of LLMs to simulate responses of broad and diverse human populations. Each test case uses the dataset-specific default grouping (e.g., "US-based Amazon Mechanical Turk workers").

SimBenchGrouped

6,343 test cases 5 survey datasets

Measures the ability of LLMs to simulate responses from specific participant groups defined by demographics (e.g., age, gender, ideology). Uses large-scale survey datasets with rich sociodemographic information.

Dataset Properties

Task Diversity

Decision-Making: Hypothetical action selection (e.g., MoralMachine, Choices13k)
Self-Assessment: Personal attribute evaluation (e.g., OpinionQA, OSPsychBig5)
Judgment: External object labeling (e.g., ChaosNLI, Jester)
Problem-Solving: Factual question answering (e.g., WisdomOfCrowds, OSPsychMGKT)

Participant Diversity

130+ countries across six continents
3 datasets exclusively from non-US regions
4 datasets with multi-country samples
8 datasets using representative sampling

Get Started

Download the Data

Access SimBench on Hugging Face to download the benchmark splits.

Download Dataset

Clone the Repository

Get the evaluation scripts and tools from our GitHub repository.

git clone https://github.com/pitehu/SimBench_release.git

Run Evaluation

Use the provided scripts to evaluate your LLM's simulation capabilities.

python generate_answers.py --input_file SimBenchPop.pkl --output_file results.pkl --model_name your-model

Explore Results

Upload your results to our interactive explorer to visualize and analyze performance.

Open Explorer

Citation

@article{hu2025simbench,
  title={SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors},
  author={Hu, Tiancheng and others},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Contact Information

For questions, feedback, or collaboration opportunities, please contact:

Tiancheng Hu
tiancheng.hu