HotPotQA

Difficulty: Intermediate | Optimizer: PromptGradOptimizer

Questions that require combining facts from multiple sentences in a provided context. A single passage never contains the full answer — the model must chain two or more inferences. Scored by token F1 rather than exact match, which makes the metric feedback more informative for the optimizer.

Full Example

import re
import string
import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class AnswerQuestion(dspy.Signature):
    """Answer the question based on the provided context."""
 
    question = dspy.InputField()
    context = dspy.InputField()
    answer = dspy.OutputField(desc="Short factual answer")
 
 
module = dspy.ChainOfThought(AnswerQuestion)
 
 
def normalize(s: str) -> str:
    s = s.lower()
    s = re.sub(r"\b(a|an|the)\b", " ", s)
    s = "".join(ch for ch in s if ch not in string.punctuation)
    return " ".join(s.split()).strip()
 
 
def token_f1(pred: str, gold: str) -> float:
    pred_tokens = normalize(pred).split()
    gold_tokens = normalize(gold).split()
    if not pred_tokens or not gold_tokens:
        return float(pred_tokens == gold_tokens)
    common = set(pred_tokens) & set(gold_tokens)
    if not common:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gold_tokens)
    return 2 * precision * recall / (precision + recall)
 
 
def metric(example, prediction):
    gold = example["gold_answer"]
    pred = getattr(prediction, "answer", str(prediction))
    f1 = token_f1(str(pred), str(gold))
    threshold = 0.6
 
    if f1 >= threshold:
        return vizpy.Score(value=f1, is_success=True, feedback=f"F1={f1:.2f}")
    return vizpy.Score(
        value=f1,
        is_success=False,
        feedback=f"F1={f1:.2f} too low. Expected: '{gold}', got: '{pred}'",
        error_type="low_f1",
    )
 
 
train_examples = [
    {
        "question": "What is the capital of the country where the Eiffel Tower is located?",
        "context": "The Eiffel Tower is on the Champ de Mars in Paris, France. Paris is the capital and most populous city of France.",
        "gold_answer": "Paris",
    },
    {
        "question": "Who directed the movie that won Best Picture at the 2020 Academy Awards?",
        "context": "Parasite won Best Picture at the 2020 Academy Awards. Parasite is a 2019 South Korean film directed by Bong Joon-ho.",
        "gold_answer": "Bong Joon-ho",
    },
    {
        "question": "What university did the founder of Amazon attend?",
        "context": "Amazon was founded by Jeff Bezos in 1994. Jeff Bezos graduated from Princeton University in 1986 with degrees in electrical engineering and computer science.",
        "gold_answer": "Princeton University",
    },
    {
        "question": "What is the official language of the country that hosted the 2016 Summer Olympics?",
        "context": "The 2016 Summer Olympics were held in Rio de Janeiro, Brazil. Portuguese is the official language of Brazil.",
        "gold_answer": "Portuguese",
    },
    {
        "question": "Who painted the ceiling of the building where the Pope resides?",
        "context": "The Pope resides in Vatican City. The Sistine Chapel is in Vatican City. Its ceiling was painted by Michelangelo between 1508 and 1512.",
        "gold_answer": "Michelangelo",
    },
]
 
val_examples = [
    {
        "question": "What year was the company that makes the iPhone founded?",
        "context": "The iPhone is made by Apple Inc. Apple Inc. was founded on April 1, 1976, by Steve Jobs, Steve Wozniak, and Ronald Wayne.",
        "gold_answer": "1976",
    },
    {
        "question": "What is the currency of the country where Toyota is headquartered?",
        "context": "Toyota is headquartered in Toyota City, Aichi, Japan. The official currency of Japan is the Japanese yen.",
        "gold_answer": "Japanese yen",
    },
    {
        "question": "What instrument did the composer of The Four Seasons play?",
        "context": "The Four Seasons is a group of violin concertos by Antonio Vivaldi. Vivaldi was an Italian Baroque composer and virtuoso violinist.",
        "gold_answer": "violin",
    },
]
 
 
optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    config=vizpy.PromptGradConfig.dev(),
)
 
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)

What the Optimizer Learns

The partial-credit F1 score means the optimizer sees how wrong an answer is, not just whether it failed. Low-F1 failures (completely off) and near-miss failures (right entity, wrong form) produce different feedback, and the optimizer uses this signal to tighten the instruction around answer specificity and entity extraction.

HotPotQA

Full Example

What the Optimizer Learns

On this page