AIME
Competition mathematics problems
Difficulty: Advanced | Optimizer: PromptGradOptimizer
American Invitational Mathematics Examination problems — competition-level math where the answer is always an integer from 0 to 999. Problems require multi-step algebraic, combinatorial, or geometric reasoning. This is one of the hardest benchmarks for language models and shows large gains from prompt optimization.
Full Example
What the Optimizer Learns
AIME failures cluster around two patterns: wrong reasoning steps and correct reasoning with a final arithmetic slip. The optimizer identifies which error type dominates — typically it adds instructions to verify the final computation, show intermediate results explicitly, and format the answer as a bare integer to prevent parse failures. The benchmark shows some of VizPy's largest performance gains precisely because the baseline model's chain-of-thought is inconsistently structured.