GSM8K
Grade school math word problems
Difficulty: Intermediate | Optimizer: PromptGradOptimizer
Multi-step arithmetic word problems where the model must extract a final integer answer. Errors fall into two categories: wrong reasoning (arithmetic mistake) and parsing failure (correct reasoning, wrong output format).
Full Example
What the Optimizer Learns
The optimizer sees failures where the model reasons correctly but buries the answer in a sentence. It typically refines the instruction to enforce a specific output format — often adding a rule like "end with the final number only, no units or explanation".