BoolQ
Binary yes/no question answering
Difficulty: Beginner | Optimizer: PromptGradOptimizer
Given a passage and a question, the model must answer true or false. The failure
mode is subtle: borderline phrasing causes the model to hedge or output unparseable
text instead of a clean boolean.
Full Example
What the Optimizer Learns
The metric uses typed error_type values — false_negative, false_positive,
unparseable — which lets the optimizer distinguish between wrong answers and
formatting failures. It tends to add an instruction that enforces clean true/false
output and clarifies how to handle hedged phrasing in the passage.