Vizpy

ARC-Challenge

Science multiple choice questions

Difficulty: Beginner | Optimizer: ContraPromptOptimizer

Grade-school science questions with four labelled choices (A–D). The model must output a single letter. The main failure mode is the model including explanation text in the answer field instead of just the letter, which causes the parser to fail.


Full Example

import re
import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class SolveARC(dspy.Signature):
    """Answer the science question by choosing the correct option."""
 
    question = dspy.InputField()
    choices = dspy.InputField(desc="Answer choices formatted as A) ... B) ... C) ... D) ...")
    answer = dspy.OutputField(desc="The correct answer letter (A, B, C, or D)")
 
 
module = dspy.ChainOfThought(SolveARC)
 
 
def extract_letter(text: str):
    match = re.search(r"\b([A-D])\b", str(text).upper())
    return match.group(1) if match else None
 
 
def metric(example, prediction):
    gold = example["gold_answer"].upper()
    pred = extract_letter(getattr(prediction, "answer", ""))
 
    if pred and pred == gold:
        return vizpy.Score(value=1.0, is_success=True, feedback=f"Correct: {gold}")
    return vizpy.Score(
        value=0.0,
        is_success=False,
        feedback=f"Expected {gold}, got {pred}",
        error_type="wrong_answer",
    )
 
 
train_examples = [
    {
        "question": "Which of the following is a renewable energy source?",
        "choices": "A) Coal  B) Natural gas  C) Solar power  D) Petroleum",
        "gold_answer": "C",
    },
    {
        "question": "What is the chemical formula for water?",
        "choices": "A) CO2  B) H2O  C) NaCl  D) O2",
        "gold_answer": "B",
    },
    {
        "question": "Which planet is closest to the Sun?",
        "choices": "A) Venus  B) Earth  C) Mercury  D) Mars",
        "gold_answer": "C",
    },
    {
        "question": "Which layer of Earth's atmosphere contains the ozone layer?",
        "choices": "A) Troposphere  B) Stratosphere  C) Mesosphere  D) Thermosphere",
        "gold_answer": "B",
    },
    {
        "question": "What is the process by which plants make their own food?",
        "choices": "A) Respiration  B) Digestion  C) Photosynthesis  D) Fermentation",
        "gold_answer": "C",
    },
    {
        "question": "What force keeps planets in orbit around the Sun?",
        "choices": "A) Magnetism  B) Friction  C) Gravity  D) Electricity",
        "gold_answer": "C",
    },
]
 
val_examples = [
    {
        "question": "Which organ pumps blood through the body?",
        "choices": "A) Brain  B) Lungs  C) Heart  D) Liver",
        "gold_answer": "C",
    },
    {
        "question": "What is the largest organ in the human body?",
        "choices": "A) Heart  B) Brain  C) Liver  D) Skin",
        "gold_answer": "D",
    },
    {
        "question": "Which gas do plants absorb from the atmosphere?",
        "choices": "A) Oxygen  B) Nitrogen  C) Carbon dioxide  D) Hydrogen",
        "gold_answer": "C",
    },
    {
        "question": "Which of these animals is an invertebrate?",
        "choices": "A) Snake  B) Frog  C) Octopus  D) Eagle",
        "gold_answer": "C",
    },
]
 
 
optimizer = vizpy.ContraPromptOptimizer(
    metric=metric,
    config=vizpy.ContraPromptConfig.dev(),
)
 
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)

What the Optimizer Learns

ContraPromptOptimizer works well here because the correct and incorrect answers are close in structure — the model often has the right reasoning but picks an adjacent choice. Contrastive examples make the distinction between similar choices explicit, and the optimizer learns to reinforce instructions that focus on the single best answer rather than hedging across options.

On this page