Vizpy

Big-Bench Hard

27 challenging reasoning subtasks

Difficulty: Advanced | Optimizer: PromptGradOptimizer

BBH is a collection of 27 reasoning tasks — boolean logic, object tracking, word sorting, date arithmetic, and more. Each example includes a task_description field so the same signature handles all subtasks. The optimizer must learn instructions that generalize across task types rather than overfitting to one.


Full Example

import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class SolveBBH(dspy.Signature):
    """Solve the reasoning task. Think step by step, then give the final answer."""
 
    task_description = dspy.InputField(desc="Description of the reasoning task type")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="The final answer only")
 
 
module = dspy.ChainOfThought(SolveBBH)
 
 
def normalize(s: str) -> str:
    s = s.strip().lower()
    for prefix in ["the answer is ", "answer: ", "final answer: "]:
        if s.startswith(prefix):
            s = s[len(prefix):]
    return " ".join(s.strip("\"'()[] .").split())
 
 
def metric(example, prediction):
    gold = normalize(example["answer"])
    pred = normalize(str(getattr(prediction, "answer", "")))
    is_correct = (pred == gold) or (gold in pred and len(gold) > 2)
 
    return vizpy.Score(
        value=1.0 if is_correct else 0.0,
        is_success=is_correct,
        feedback=f"Expected: {gold}, Got: {pred}",
        error_type="incorrect_answer" if not is_correct else None,
    )
 
 
train_examples = [
    {
        "task_description": "Evaluate a boolean expression with AND, OR, NOT operators.",
        "question": "not ( ( not not True ) ) is",
        "answer": "False",
    },
    {
        "task_description": "Sort a list of words alphabetically.",
        "question": "Sort the following words alphabetically: List: banana apple cherry",
        "answer": "apple banana cherry",
    },
    {
        "task_description": "Count the total number of objects described in the question.",
        "question": "I have a chair, two tables, a lamp, and three books. How many objects do I have?",
        "answer": "7",
    },
    {
        "task_description": "Determine the final position after a series of navigation instructions.",
        "question": "If you follow these instructions, do you return to the starting point? Turn left. Take 3 steps. Turn right. Take 3 steps. Turn right. Take 3 steps. Turn left. Take 3 steps.",
        "answer": "No",
    },
    {
        "task_description": "Solve a multi-step arithmetic expression.",
        "question": "((-3 + 5) * (2 - 8)) =",
        "answer": "-12",
    },
]
 
val_examples = [
    {
        "task_description": "Evaluate a boolean expression with AND, OR, NOT operators.",
        "question": "not not ( True and False ) is",
        "answer": "False",
    },
    {
        "task_description": "Sort a list of words alphabetically.",
        "question": "Sort the following words alphabetically: List: dog cat bird ant",
        "answer": "ant bird cat dog",
    },
    {
        "task_description": "Count the total number of objects described in the question.",
        "question": "I have two cats, three dogs, and four fish. How many pets do I have?",
        "answer": "9",
    },
    {
        "task_description": "Solve a multi-step arithmetic expression.",
        "question": "((8 - 3) * (4 + 1)) =",
        "answer": "25",
    },
]
 
 
optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    config=vizpy.PromptGradConfig.dev(),
)
 
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)

What the Optimizer Learns

BBH's diversity means errors come from different failure modes in different subtasks. The optimizer accumulates rules that address the most common cross-task patterns — typically: enforce the exact answer format (no surrounding text), be literal about sorting/counting rather than approximating, and apply logical operators strictly.

On this page