Vizpy

Email Urgency Classification

Fix a systematic model bias in one optimize() call

Email Urgency Classification

GPT-4o-mini has a surface-level bias: any email containing "blocking", "urgent", or "ASAP" gets classified as CRITICAL. The word triggers the label, regardless of what the email is actually about.

The distinction the model needs to learn: CRITICAL means customers are affected right now — production outages, data loss, security incidents. HIGH means one team's velocity is blocked — CI failures, staging issues, sprint blockers. The urgency language is the same. The impact radius is completely different.

This example demonstrates PromptGradOptimizer learning that boundary from labelled examples and targeted metric feedback.

Optimizer: PromptGradOptimizer Difficulty: Intermediate


The Failure Mode

test_email = "Blocking our sprint. API tests failing, can't merge. Need help ASAP."
 
baseline = module(email=test_email)
print(baseline.urgency)  # CRITICAL — wrong. This is HIGH.

The model sees "ASAP", "blocking", "failing" and pattern-matches to CRITICAL. It doesn't reason about who is affected or what breaks if this isn't resolved in the next hour.


Full Example

import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class EmailUrgency(dspy.Signature):
    """Classify the urgency level of an email.
    Return ONLY the urgency level."""
 
    email: str = dspy.InputField(desc="Email body text")
    urgency: str = dspy.OutputField(desc="One of: CRITICAL, HIGH, MEDIUM, LOW")
 
 
class EmailClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.ChainOfThought(EmailUrgency)
 
    def forward(self, email: str):
        return self.predict(email=email)
 
 
module = EmailClassifier()
 
 
# Labels are stored separately — the optimizer never sees them directly
_train_data = [
    ("ALERT: Production database is down. All customer transactions "
     "failing. Revenue impact $50K/hour.", "CRITICAL"),
    ("SECURITY INCIDENT: Unauthorized access on admin panel. "
     "Credentials compromised.", "CRITICAL"),
    ("Data corruption in backup pipeline. Lost 48 hours of "
     "customer records.", "CRITICAL"),
    ("Blocking our sprint. API tests failing, can't merge. "
     "Need help ASAP.", "HIGH"),
    ("Client demo tomorrow, login page 500 on staging. "
     "Blocking sales team.", "HIGH"),
    ("Urgent: Deploy pipeline broken. No one can ship code. "
     "All teams blocked.", "HIGH"),
    ("Can we schedule a meeting for Q3 roadmap?", "MEDIUM"),
    ("Design mockups ready for review. Take a look when you "
     "get a chance.", "MEDIUM"),
    ("Typo on settings page. Not urgent but should be fixed.", "MEDIUM"),
    ("FYI: Updated office Wi-Fi password.", "LOW"),
    ("Recording from last week's all-hands. No action needed.", "LOW"),
    ("Happy birthday Sarah! Cake at 3pm.", "LOW"),
]
 
_val_data = [
    ("URGENT blocking us. Staging credentials expired, QA can't "
     "run tests. Rotate keys ASAP.", "HIGH"),
    ("CRITICAL: Payment system returning errors. Customers can't "
     "buy. P0.", "CRITICAL"),
    ("Heads up, team lunch Friday moved to Italian place.", "LOW"),
]
 
_expected = {}
 
train_examples = []
for email, urgency in _train_data:
    train_examples.append({"email": email})
    _expected[email] = urgency
 
val_examples = []
for email, urgency in _val_data:
    val_examples.append({"email": email})
    _expected[email] = urgency
 
 
def metric(example, prediction):
    expected = _expected[example["email"]]
    actual = prediction.urgency.strip().upper()
    is_correct = expected == actual
 
    feedback = ""
    if not is_correct:
        if expected == "HIGH" and actual == "CRITICAL":
            feedback = (
                "CRITICAL is ONLY for outages, data loss, or security breaches affecting customers. "
                "A CI failure or staging issue that blocks one team = HIGH, regardless of urgency language."
            )
        elif expected == "CRITICAL" and actual == "HIGH":
            feedback = (
                "If customers cannot transact, data is lost, or credentials are compromised, "
                "that is always CRITICAL — even if the email tone is calm."
            )
        else:
            feedback = f"Expected {expected}, got {actual}."
 
    return vizpy.Score(
        value=1.0 if is_correct else 0.0,
        is_success=is_correct,
        feedback=feedback,
    )
 
 
# Baseline — before optimization
test_email = (
    "URGENT blocking us. Staging credentials expired, QA can't "
    "run tests. Rotate keys ASAP."
)
baseline = module(email=test_email)
print(f"Before: {baseline.urgency}")  # CRITICAL (wrong)
 
 
# Optimize
config = vizpy.PromptGradConfig.dev()
optimizer = vizpy.PromptGradOptimizer(metric=metric, config=config)
optimized = optimizer.optimize(
    module=module,
    train_examples=train_examples,
    val_examples=val_examples,
)
 
 
# After optimization
result = optimized(email=test_email)
print(f"After: {result.urgency}")  # HIGH (correct)

What the Optimizer Learns

PromptGradOptimizer runs in epochs. Each epoch:

  1. Samples a batch of training examples, runs them through the current module
  2. Collects failures and their metric feedback
  3. Asks: "what single instruction change would have prevented most of these failures?"
  4. Validates the proposed rule on held-out examples — discards it if it introduces new errors
  5. Accumulates validated rules into the module's instructions

For this task, the accumulated rule ends up being something like:

"CRITICAL = customer-facing impact (outage, data loss, security breach). HIGH = internal team velocity blocked. Apply this distinction even when the email uses urgent language — the impact radius determines the level, not the tone."


Results

Val Score
Baseline (GPT-4o-mini, no optimization)0.67
After PromptGradOptimizer1.00

The staging credentials email that was misclassified as CRITICAL routes correctly to HIGH. The payment system outage stays CRITICAL. The team lunch stays LOW.


Why the Feedback Wording Matters

The feedback in this metric is deliberately phrased as a rule, not an observation:

# Good — states the rule directly
feedback = (
    "CRITICAL is ONLY for outages, data loss, or security breaches affecting customers. "
    "A CI failure or staging issue that blocks one team = HIGH, regardless of urgency language."
)
 
# Less effective — describes the error without explaining the distinction
feedback = "This should be HIGH, not CRITICAL."

The optimizer uses this text when generating candidate rules. Feedback that states the boundary explicitly tends to produce more precise, transferable instructions.

On this page