Migration from GEPA

GEPA is DSPy's evolutionary optimizer for prompt instructions. It generates multiple instruction variants, evaluates them on your training set, uses an LLM to reflect on failures, and selects the best-performing variants — repeating this across several generations.

GEPA is good at broad search: finding a substantially better starting instruction when you don't know what the right prompt looks like. It explores a wide space and can make large jumps in performance early on.

Where GEPA plateaus: targeted refinement. Once it's found a reasonable instruction, the evolutionary search doesn't have a mechanism to accumulate specific rules from failure analysis — each generation starts fresh rather than building on what was learned.

Vizpy's PromptGradOptimizer is designed for exactly this stage.

Two Migration Paths

Path 1: Replace GEPA entirely

If you're not getting meaningful gains from GEPA after the first few generations, replace it with PromptGradOptimizer. It starts from your module's existing instructions and accumulates targeted improvements from batch failure analysis.

Path 2: Use GEPA as initialization, PromptGrad for refinement

This is the recommended path if GEPA has already found a reasonable base instruction. PromptGradOptimizer accepts base_prompt_source="gepa" — it runs GEPA internally to get a strong starting point, then applies gradient-based refinement on top.

API Changes

Before (GEPA):

optimizer = dspy.GEPA(metric=metric)
optimized = optimizer.compile(module, trainset=examples)

After (Vizpy, Path 1 — full replacement):

optimizer = vizpy.PromptGradOptimizer(metric=metric)
optimized = optimizer.optimize(module, examples)

After (Vizpy, Path 2 — GEPA base + PromptGrad refinement):

optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    base_prompt_source="gepa",
)
optimized = optimizer.optimize(module, examples)

Full Example: Changelog Generation

This task benefits from the two-stage approach. GEPA is effective at discovering that user-facing language is needed. PromptGrad then accumulates specific rules about vocabulary substitution (e.g. "session middleware" → "during checkout").

import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
 
class ChangelogEntry(dspy.Signature):
    """Translate a commit into a user-facing changelog entry."""
    commit: str = dspy.InputField()
    context: str = dspy.InputField(desc="Affected feature area")
    entry: str = dspy.OutputField(desc="One sentence from the user's perspective")
 
 
module = dspy.ChainOfThought(ChangelogEntry)
 
 
train_data = [
    {
        "commit": "fix: handle null user_id in session middleware",
        "context": "Checkout — session expiry during purchase",
        "jargon": ["null", "middleware", "session middleware", "user_id"],
        "impact": "checkout crash",
    },
    {
        "commit": "feat: add retry logic to payment webhook handler",
        "context": "Subscriptions — failed payment notifications",
        "jargon": ["retry logic", "webhook", "handler"],
        "impact": "subscription recovery",
    },
    {
        "commit": "fix: prevent race condition in concurrent file uploads",
        "context": "Document upload — simultaneous file uploads",
        "jargon": ["race condition", "concurrent"],
        "impact": "multiple file upload",
    },
    {
        "commit": "fix: sanitize HTML in user-submitted comments",
        "context": "Forums — special characters in comments",
        "jargon": ["sanitize", "HTML"],
        "impact": "comment display",
    },
]
 
 
def metric(example, prediction):
    entry = prediction.entry.lower()
    leaked = [j for j in example["jargon"] if j.lower() in entry]
    impact_words = example["impact"].lower().split()
    conveys = sum(1 for w in impact_words if w in entry) >= len(impact_words) * 0.5
 
    if not leaked and conveys:
        return vizpy.Score(value=1.0, is_success=True, feedback="")
 
    feedback_parts = []
    if leaked:
        feedback_parts.append(
            f"Technical jargon leaked through: {', '.join(leaked)}. "
            f"Rewrite without implementation terms."
        )
    if not conveys:
        feedback_parts.append(
            f"Doesn't convey the user impact ('{example['impact']}'). "
            f"Lead with what changed for the user, not what the code does."
        )
 
    return vizpy.Score(
        value=0.3 if conveys else 0.0,
        is_success=False,
        feedback=" | ".join(feedback_parts),
        error_type="jargon_leak" if leaked else "missing_impact",
    )
 
 
# Two-stage optimization: GEPA finds the register shift, PromptGrad refines vocabulary rules
optimizer = vizpy.PromptGradOptimizer(
    metric=metric,
    base_prompt_source="gepa",
)
optimized = optimizer.optimize(module, train_data)
 
result = optimized(
    commit="fix: resolve deadlock in database connection pool",
    context="High-traffic plans — intermittent request failures",
)
print(result.entry)
# "Fixed intermittent request failures that affected high-traffic accounts."

When the Two-Stage Approach Wins

The reason to stack GEPA + PromptGrad rather than using either alone:

GEPA explores broadly and can discover that the entire register needs to shift (e.g. "write as user impact, not code description"). It makes the big jump.
PromptGrad then accumulates specific rules from failure patterns: specific vocabulary substitutions, exception cases, edge conditions that the broad GEPA instruction doesn't handle.

The result is instructions that are globally correct (GEPA) and locally precise (PromptGrad). You get the exploration benefit of evolutionary search and the precision benefit of gradient-based refinement.

From the research backing these optimizers: this two-stage architecture on HotPotQA improved normalized performance from the GEPA baseline of +80% to a combined +126% — the additional PromptGrad refinement stage contributed the remaining gain.

When to Use GEPA Alone vs. Full Replacement

Situation	Recommendation
GEPA has plateaued and adding more generations doesn't help	Replace with `PromptGradOptimizer`
GEPA found a strong base but plateaued	Use `base_prompt_source="gepa"`
You want interpretable, accumulated rules	`PromptGradOptimizer` either way
You have < 20 training examples	`ContraPromptOptimizer` is faster and more sample-efficient
Task has clear contrastive pairs (right vs. wrong label)	`ContraPromptOptimizer` instead of GEPA path

Migration from GEPA

Migration from GEPA

Two Migration Paths

API Changes

Full Example: Changelog Generation

When the Two-Stage Approach Wins

When to Use GEPA Alone vs. Full Replacement

On this page