Email Urgency Classification
Fix a systematic model bias in one optimize() call
Email Urgency Classification
GPT-4o-mini has a surface-level bias: any email containing "blocking", "urgent", or "ASAP" gets classified as CRITICAL. The word triggers the label, regardless of what the email is actually about.
The distinction the model needs to learn: CRITICAL means customers are affected right now — production outages, data loss, security incidents. HIGH means one team's velocity is blocked — CI failures, staging issues, sprint blockers. The urgency language is the same. The impact radius is completely different.
This example demonstrates PromptGradOptimizer learning that boundary from labelled
examples and targeted metric feedback.
Optimizer: PromptGradOptimizer
Difficulty: Intermediate
The Failure Mode
The model sees "ASAP", "blocking", "failing" and pattern-matches to CRITICAL. It doesn't reason about who is affected or what breaks if this isn't resolved in the next hour.
Full Example
What the Optimizer Learns
PromptGradOptimizer runs in epochs. Each epoch:
- Samples a batch of training examples, runs them through the current module
- Collects failures and their metric feedback
- Asks: "what single instruction change would have prevented most of these failures?"
- Validates the proposed rule on held-out examples — discards it if it introduces new errors
- Accumulates validated rules into the module's instructions
For this task, the accumulated rule ends up being something like:
"CRITICAL = customer-facing impact (outage, data loss, security breach). HIGH = internal team velocity blocked. Apply this distinction even when the email uses urgent language — the impact radius determines the level, not the tone."
Results
| Val Score | |
|---|---|
| Baseline (GPT-4o-mini, no optimization) | 0.67 |
After PromptGradOptimizer | 1.00 |
The staging credentials email that was misclassified as CRITICAL routes correctly to HIGH. The payment system outage stays CRITICAL. The team lunch stays LOW.
Why the Feedback Wording Matters
The feedback in this metric is deliberately phrased as a rule, not an observation:
The optimizer uses this text when generating candidate rules. Feedback that states the boundary explicitly tends to produce more precise, transferable instructions.