GSM8K

Difficulty: Intermediate | Optimizer: PromptGradOptimizer

Multi-step arithmetic word problems where the model must extract a final integer answer. Errors fall into two categories: wrong reasoning (arithmetic mistake) and parsing failure (correct reasoning, wrong output format).

Full Example

import re
import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class SolveGSM8K(dspy.Signature):
    """Solve the math word problem. Show your work step by step. End with the final integer answer."""
 
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Final integer answer only")
 
 
module = dspy.ChainOfThought(SolveGSM8K)
 
 
def parse_integer(answer_str: str):
    match = re.search(r"####\s*(-?\d[\d,]*)", str(answer_str))
    if match:
        return int(match.group(1).replace(",", ""))
    numbers = re.findall(r"-?\d[\d,]*", str(answer_str))
    if numbers:
        return int(numbers[-1].replace(",", ""))
    return None
 
 
def metric(example, prediction):
    gold = example["gold_answer"]
    pred = parse_integer(getattr(prediction, "answer", ""))
 
    if pred is not None and pred == gold:
        return vizpy.Score(value=1.0, is_success=True, feedback=f"Correct: {gold}")
    return vizpy.Score(
        value=0.0,
        is_success=False,
        feedback=f"Expected {gold}, got {pred}",
        error_type="wrong_answer",
    )
 
 
train_examples = [
    {
        "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast and bakes muffins with four. She sells the remainder at $2 per egg. How much does she make daily?",
        "gold_answer": 18,
    },
    {
        "question": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total?",
        "gold_answer": 3,
    },
    {
        "question": "Josh buys a house for $80,000 and puts $50,000 in repairs, increasing its value by 150%. How much profit did he make?",
        "gold_answer": 70000,
    },
    {
        "question": "Kylar wants to buy 16 glasses. One glass costs $5, but every second glass costs 60% of the price. How much does he pay?",
        "gold_answer": 64,
    },
    {
        "question": "Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many as Seattle. Seattle has 20 sheep. How many do all three have together?",
        "gold_answer": 260,
    },
]
 
val_examples = [
    {
        "question": "Eliza earns $10/hour for the first 40 hours and 1.2x for overtime. She worked 45 hours. What are her total earnings?",
        "gold_answer": 460,
    },
    {
        "question": "A program had 60 downloads in month 1, three times as many in month 2, then 30% fewer in month 3. What is the total?",
        "gold_answer": 366,
    },
    {
        "question": "Carlos plants a lemon tree for $90. It grows 7 lemons/year at $1.50 each, costing $3/year to maintain. How many years to break even?",
        "gold_answer": 13,
    },
]
 
 
optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    config=vizpy.PromptGradConfig.dev(),
)
 
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)

What the Optimizer Learns

The optimizer sees failures where the model reasons correctly but buries the answer in a sentence. It typically refines the instruction to enforce a specific output format — often adding a rule like "end with the final number only, no units or explanation".

GSM8K

Full Example

What the Optimizer Learns

On this page