Examples
Runnable examples for VizPy prompt optimizers — from real product problems to benchmarks
Examples
These examples are designed around one question: when does the optimizer actually matter?
Each real-world example has a specific, non-obvious failure mode — the kind where you'd spend a day rewriting your prompt and still not fix it, because the issue isn't word choice, it's that the model has the wrong mental model of the task. The optimizer finds and articulates that mental model for you.
Real-World Use Cases
Email Urgency Classification
GPT over-triggers on urgency words — 'blocking' and 'ASAP' both become CRITICAL. The optimizer learns that workflow blockers ≠ production outages.
Recipe Difficulty Rating
The model rates by ingredient count. A 5-ingredient beef wellington shouldn't be Easy. The optimizer learns to read technique, not just lists.
Meeting Action Items
'Someone should fix that' is not an action item. 'I'll handle it by Friday' is. The optimizer learns the linguistic signals of genuine commitment.
Commit Log → Changelog
'Fixed null check in session middleware' is not a changelog entry. The optimizer learns to translate technical cause into user impact.
Benchmarks
Standard research benchmarks — useful for measuring optimizer performance with ground-truth labels and comparing across runs.
BoolQ
Binary yes/no question answering. Good starting point.
GSM8K
Grade school math word problems. Tests multi-step reasoning.
BBH
27 challenging reasoning subtasks from Big-Bench Hard.
HotPotQA
Multi-hop QA requiring reasoning over multiple documents.
ARC-Challenge
Science multiple-choice questions from AI2 Reasoning Challenge.
AIME
Competition mathematics. The hardest benchmark in this set.
Which Optimizer for Which Task?
| Task type | Recommended optimizer | Why |
|---|---|---|
| Classification with a systematic bias | ContraPromptOptimizer | Contrastive mining finds what separates correct from incorrect |
| Open-ended generation quality | PromptGradOptimizer | Batch gradient analysis handles rubric-based metrics better |
| Extraction with subtle ownership/attribution | PromptGradOptimizer | Accumulates rules across many failure examples |
| Translation between registers (technical→user) | ContraPromptOptimizer | Clear contrastive pairs exist between good and bad output |
Both optimizers accept the same interface — you can swap them without changing your metric or examples.