HotPotQA
Multi-hop question answering
Difficulty: Intermediate | Optimizer: PromptGradOptimizer
Questions that require combining facts from multiple sentences in a provided context. A single passage never contains the full answer — the model must chain two or more inferences. Scored by token F1 rather than exact match, which makes the metric feedback more informative for the optimizer.
Full Example
What the Optimizer Learns
The partial-credit F1 score means the optimizer sees how wrong an answer is, not just whether it failed. Low-F1 failures (completely off) and near-miss failures (right entity, wrong form) produce different feedback, and the optimizer uses this signal to tighten the instruction around answer specificity and entity extraction.