Vizpy
Guides

Migration from DSPy

Migrating from DSPy optimizers (MIPROv2, BootstrapFewShot, COPRO) to Vizpy

Migration from DSPy

If you're using DSPy's built-in optimizers (MIPROv2, BootstrapFewShot, COPRO), migrating to Vizpy is mostly a one-line change. The conceptual difference is deeper and worth understanding.

What Changes — and Why

DSPy's built-in optimizers are demonstration-based: they search for good few-shot examples to prepend to your prompt. They work by running your module on a training set, finding high-scoring traces, and using those as demonstrations.

Vizpy's optimizers are failure-based: they analyze why predictions are wrong, extract correction rules from that analysis, and synthesize those rules into instructions. The optimizer reads your metric feedback, not just your metric score.

This distinction matters most when:

  • The model is making a systematic error (always does X when it should do Y)
  • The error requires understanding context, not just seeing more examples
  • You want to know what rule was learned, not just whether the score went up

The Metric Change

This is the only breaking change. DSPy metrics return a bool or float. Vizpy metrics return a vizpy.Score.

Before:

def metric(example, prediction, trace=None) -> bool:
    return prediction.answer.lower() == example["answer"].lower()

After:

def metric(example, prediction) -> vizpy.Score:
    correct = prediction.answer.lower() == example["answer"].lower()
    return vizpy.Score(
        value=1.0 if correct else 0.0,
        is_success=correct,
        feedback=f"Expected '{example['answer']}', got '{prediction.answer}'" if not correct else "",
    )

The feedback field is optional — you'll get a working optimizer without it. But feedback is how the optimizer understands why a prediction failed, which determines the quality of the rules it generates. Richer feedback → more precise rules.


API Changes

DSPyVizpyNotes
optimizer.compile(module, trainset=examples)optimizer.optimize(module, examples)compileoptimize, trainset= is positional
dspy.MIPROv2(metric=metric)vizpy.ContraPromptOptimizer(metric=metric)or PromptGradOptimizer
dspy.BootstrapFewShot(metric=metric)vizpy.ContraPromptOptimizer(metric=metric)
Metric returns bool or floatMetric returns vizpy.ScoreAdd feedback for best results

Full Migration Example

Here's a before/after for a real task: classifying commit messages by type.

The DSPy version finds good few-shot examples. The Vizpy version learns the rule that separates feat from fix from refactor — which is harder to convey with examples alone when the commits are ambiguous.

Before (DSPy MIPROv2):

import dspy
 
class ClassifyCommit(dspy.Signature):
    """Classify a git commit message by type."""
    message: str = dspy.InputField()
    commit_type: str = dspy.OutputField(desc="One of: feat, fix, refactor, docs, test, chore")
 
module = dspy.Predict(ClassifyCommit)
 
def metric(example, prediction, trace=None):
    return prediction.commit_type.lower() == example["commit_type"].lower()
 
examples = [dspy.Example(**ex).with_inputs("message") for ex in train_data]
 
optimizer = dspy.MIPROv2(metric=metric, auto="light")
optimized = optimizer.compile(module, trainset=examples)

After (Vizpy ContraPromptOptimizer):

import dspy
import vizpy
 
class ClassifyCommit(dspy.Signature):
    """Classify a git commit message by type."""
    message: str = dspy.InputField()
    commit_type: str = dspy.OutputField(desc="One of: feat, fix, refactor, docs, test, chore")
 
module = dspy.Predict(ClassifyCommit)
 
# The feedback explains the key distinction the model keeps missing
COMMIT_TYPE_RULES = {
    "feat": "feat = new user-visible behaviour that didn't exist before",
    "fix": "fix = corrects behaviour that was broken; user experienced a bug",
    "refactor": "refactor = code restructure with no behaviour change; user sees nothing",
    "docs": "docs = only documentation files changed",
    "test": "test = only test files changed",
    "chore": "chore = tooling, deps, CI — nothing that touches runtime behaviour",
}
 
def metric(example, prediction):
    expected = example["commit_type"].lower()
    actual = prediction.commit_type.strip().lower()
    is_correct = expected == actual
 
    feedback = ""
    if not is_correct:
        feedback = (
            f"Classified as '{actual}', should be '{expected}'. "
            f"Rule for '{expected}': {COMMIT_TYPE_RULES.get(expected, '')} "
            f"Rule for '{actual}': {COMMIT_TYPE_RULES.get(actual, '')}"
        )
 
    return vizpy.Score(
        value=1.0 if is_correct else 0.0,
        is_success=is_correct,
        feedback=feedback,
        error_type=f"{actual}_as_{expected}" if not is_correct else "",
    )
 
optimizer = vizpy.ContraPromptOptimizer(metric=metric)
optimized = optimizer.optimize(module, train_data)

The key insight: by providing the rule for both the wrong label and the right label in the feedback, the optimizer can extract a precise decision boundary — not just "use feat more often" but "feat = new user-visible behaviour; fix = corrects broken behaviour."


Migration Checklist

Update your metric signature

Change return type from bool/float to vizpy.Score. Add feedback that explains why a prediction is wrong in concrete, rule-shaped language.

Remove DSPy Example wrapping

DSPy requires dspy.Example(**ex).with_inputs("field"). Vizpy takes plain dicts — pass train_data directly.

Change the optimizer call

optimizer.compile(module, trainset=examples)optimizer.optimize(module, examples)

Choose your optimizer

Use ContraPromptOptimizer if you're replacing MIPROv2 or BootstrapFewShot. Use PromptGradOptimizer if you have a larger dataset (50+ examples) and want batch-level gradient analysis.


Choosing Between the Two Optimizers

If you're not sure which to use, start with ContraPromptOptimizer. It's faster, more interpretable, and works well for most classification and extraction tasks.

Use PromptGradOptimizer when:

  • Your metric is a continuous score (rubric-based evaluation, semantic similarity)
  • You have 50+ training examples
  • The failure mode is distributed across many examples rather than appearing as clear contrastive pairs

See the examples section for side-by-side comparisons across different task types.

On this page