Overall simulation ability measured by SimBench score averaged across the two main splits
Rank | Model | Type | Release | Score (S ↑) |
---|
Note: Reasoning models are highlighted in italics.
Baseline: Models below the dotted line perform worse than a uniform baseline.
Score Range: SimBench scores range from -∞ to 100, with higher scores indicating better simulation ability.