Runnable examples for VizPy prompt optimizers — from real product problems to benchmarks

Examples

These examples are designed around one question: when does the optimizer actually matter?

Each real-world example has a specific, non-obvious failure mode — the kind where you'd spend a day rewriting your prompt and still not fix it, because the issue isn't word choice, it's that the model has the wrong mental model of the task. The optimizer finds and articulates that mental model for you.

Real-World Use Cases

Email Urgency Classification

GPT over-triggers on urgency words — 'blocking' and 'ASAP' both become CRITICAL. The optimizer learns that workflow blockers ≠ production outages.

Recipe Difficulty Rating

The model rates by ingredient count. A 5-ingredient beef wellington shouldn't be Easy. The optimizer learns to read technique, not just lists.

Meeting Action Items

'Someone should fix that' is not an action item. 'I'll handle it by Friday' is. The optimizer learns the linguistic signals of genuine commitment.

Commit Log → Changelog

'Fixed null check in session middleware' is not a changelog entry. The optimizer learns to translate technical cause into user impact.

Benchmarks

Standard research benchmarks — useful for measuring optimizer performance with ground-truth labels and comparing across runs.

BoolQ

Binary yes/no question answering. Good starting point.

GSM8K

Grade school math word problems. Tests multi-step reasoning.

BBH

27 challenging reasoning subtasks from Big-Bench Hard.

HotPotQA

Multi-hop QA requiring reasoning over multiple documents.

ARC-Challenge

Science multiple-choice questions from AI2 Reasoning Challenge.

AIME

Competition mathematics. The hardest benchmark in this set.

Which Optimizer for Which Task?

Task type	Recommended optimizer	Why
Classification with a systematic bias	`ContraPromptOptimizer`	Contrastive mining finds what separates correct from incorrect
Open-ended generation quality	`PromptGradOptimizer`	Batch gradient analysis handles rubric-based metrics better
Extraction with subtle ownership/attribution	`PromptGradOptimizer`	Accumulates rules across many failure examples
Translation between registers (technical→user)	`ContraPromptOptimizer`	Clear contrastive pairs exist between good and bad output

Both optimizers accept the same interface — you can swap them without changing your metric or examples.

Examples

Examples

Real-World Use Cases

Email Urgency Classification

Recipe Difficulty Rating

Meeting Action Items

Commit Log → Changelog

Benchmarks

BoolQ

GSM8K

BBH

HotPotQA

ARC-Challenge

AIME

Which Optimizer for Which Task?

On this page