Vizpy

AIME

Competition mathematics problems

Difficulty: Advanced | Optimizer: PromptGradOptimizer

American Invitational Mathematics Examination problems — competition-level math where the answer is always an integer from 0 to 999. Problems require multi-step algebraic, combinatorial, or geometric reasoning. This is one of the hardest benchmarks for language models and shows large gains from prompt optimization.


Full Example

import re
import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class SolveAIME(dspy.Signature):
    """Solve this AIME math competition problem. Think step-by-step and provide the final integer answer (0-999)."""
 
    problem = dspy.InputField()
    answer = dspy.OutputField(desc="Final integer answer (0-999)")
 
 
module = dspy.ChainOfThought(SolveAIME)
 
 
def parse_aime_answer(answer_str: str):
    match = re.search(r"###\s*(\d+)", str(answer_str))
    if match:
        return int(match.group(1))
    match = re.search(r"(?:answer is|equals?|=)\s*(\d+)", str(answer_str), re.IGNORECASE)
    if match:
        return int(match.group(1))
    numbers = re.findall(r"\b(\d{1,3})\b", str(answer_str))
    for num in reversed(numbers):
        val = int(num)
        if 0 <= val <= 999:
            return val
    return None
 
 
def metric(example, prediction):
    gold = example["gold_answer"]
    pred_str = getattr(prediction, "answer", str(prediction))
    pred = parse_aime_answer(pred_str)
 
    if pred is None:
        return vizpy.Score(
            value=0.0,
            is_success=False,
            feedback=f"Could not parse answer from: {pred_str[:100]}",
            error_type="parse_error",
        )
 
    correct = pred == gold
    return vizpy.Score(
        value=1.0 if correct else 0.0,
        is_success=correct,
        feedback=f"Predicted {pred}, expected {gold}",
    )
 
 
train_examples = [
    {
        "problem": "Find the remainder when 2^2007 is divided by 13.",
        "gold_answer": 7,
    },
    {
        "problem": "How many positive integers less than 1000 are divisible by neither 3 nor 7?",
        "gold_answer": 571,
    },
    {
        "problem": "In triangle ABC, AB = 13, BC = 14, and CA = 15. Find the length of the altitude from A to BC.",
        "gold_answer": 12,
    },
    {
        "problem": "Find the number of ordered pairs (a, b) of positive integers such that a + b = 100 and gcd(a, b) = 5.",
        "gold_answer": 8,
    },
    {
        "problem": "Find the sum of all positive integers n such that n^2 + 12n - 2007 is a perfect square.",
        "gold_answer": 80,
    },
]
 
val_examples = [
    {
        "problem": "How many three-digit positive integers have the property that the middle digit equals the sum of the first and last digits?",
        "gold_answer": 45,
    },
    {
        "problem": "Find the number of positive integers n <= 100 such that n and n+1 are both squarefree.",
        "gold_answer": 61,
    },
]
 
 
optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    config=vizpy.PromptGradConfig.dev(),
)
 
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)

What the Optimizer Learns

AIME failures cluster around two patterns: wrong reasoning steps and correct reasoning with a final arithmetic slip. The optimizer identifies which error type dominates — typically it adds instructions to verify the final computation, show intermediate results explicitly, and format the answer as a bare integer to prevent parse failures. The benchmark shows some of VizPy's largest performance gains precisely because the baseline model's chain-of-thought is inconsistently structured.

On this page