Vizpy

Introduction

Better prompts in one API call

Vizpy automatically optimizes your LLM prompts by learning from failures. One API call, dramatically better results.

Quickstart

pip install vizpy dspy-ai
export VIZPY_API_KEY="..."
export OPENAI_API_KEY="sk-..."   # or your provider's key — see Supported Models below
import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))  # use any model you have access to
 
# 1. Define your task
class Sentiment(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    label: str = dspy.OutputField(desc="One of: POSITIVE, NEGATIVE, NEUTRAL")
 
module = dspy.ChainOfThought(Sentiment)
 
# 2. Define a metric — feedback is what the optimizer learns from
def metric(example, pred):
    expected = example["label"]
    actual = pred.label.strip().upper()
    return vizpy.Score(
        value=1.0 if expected == actual else 0.0,
        is_success=expected == actual,
        feedback="" if expected == actual else f"Expected {expected}, got {actual}.",
    )
 
# 3. A handful of labelled examples
train = [
    {"review": "Broke after one week.", "label": "NEGATIVE"},
    {"review": "Exceeded my expectations, very happy.", "label": "POSITIVE"},
    {"review": "Works as described, nothing special.", "label": "NEUTRAL"},
    {"review": "Stopped working on day two.", "label": "NEGATIVE"},
    {"review": "Solid build quality, does exactly what it promises.", "label": "POSITIVE"},
]
 
# 4. Optimize
optimizer = vizpy.ContraPromptOptimizer(metric=metric)
optimized = optimizer.optimize(module, train_examples=train)
 
# 5. Use the result — same interface, better instructions
print(optimized(review="Feels cheap and the buttons stick.").label)  # NEGATIVE

Supported Models

Vizpy only optimizes the prompt — it never calls your model directly. You configure the model through DSPy, and Vizpy works with whatever you point it at.

TypeExamples
Hosted APIsOpenAI, Anthropic, Mistral, Google Gemini, Cohere
Self-hostedOllama, vLLM, LM Studio, any OpenAI-compatible endpoint
Custom endpointsInternal proxies, fine-tuned models behind an API
# Hosted API
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
dspy.configure(lm=dspy.LM("anthropic/claude-haiku-4-5-20251001"))
 
# Self-hosted via Ollama
dspy.configure(lm=dspy.LM("ollama/llama3", api_base="http://localhost:11434"))
 
# Any OpenAI-compatible endpoint (vLLM, LM Studio, internal proxy, etc.)
dspy.configure(lm=dspy.LM("openai/your-model", api_base="http://your-host/v1", api_key="..."))

The optimizer runs on Vizpy's servers using your VIZPY_API_KEY. Your model and your data stay wherever you host them.

The Problem

Prompts fail for specific, fixable reasons — the model has the wrong mental model of your task. You can't fix this by rewriting words. You need to know what the model thinks it's supposed to do, and correct that.

Vizpy finds it. It runs your examples, extracts the rule that explains each failure, validates that the rule actually helps, and synthesizes everything into precise instructions you can read.

Quick Example

GPT-4o-mini misclassifies workflow blockers as CRITICAL because it pattern-matches on "ASAP" and "blocking" instead of reasoning about impact. CRITICAL should be reserved for production outages and security incidents — not a broken CI pipeline.

import dspy
import vizpy
 
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
 
class EmailUrgency(dspy.Signature):
    """Classify the urgency level of an email."""
    email: str = dspy.InputField()
    urgency: str = dspy.OutputField(desc="One of: CRITICAL, HIGH, MEDIUM, LOW")
 
module = dspy.ChainOfThought(EmailUrgency)
 
# Before optimization
test = "Blocking our sprint. Tests failing, can't merge. Need help ASAP."
print(module(email=test).urgency)   # CRITICAL  ← wrong, this is HIGH
 
# Metric feedback tells the optimizer *why* it's wrong
def metric(example, prediction):
    expected, actual = example["gold_urgency"], prediction.urgency.strip().upper()
    is_correct = expected == actual
    feedback = (
        "CRITICAL = customer-facing outage/breach. Workflow blockers = HIGH."
        if not is_correct and expected == "HIGH" and actual == "CRITICAL" else
        f"Expected {expected}, got {actual}." if not is_correct else ""
    )
    return vizpy.Score(value=1.0 if is_correct else 0.0, is_success=is_correct, feedback=feedback)
 
optimizer = vizpy.PromptGradOptimizer(metric=metric)
optimized = optimizer.optimize(module, train_examples, val_examples)
 
# After optimization
print(optimized(email=test).urgency)  # HIGH  ← correct

What the optimizer learned:

"CRITICAL = customer-facing impact (outage, data loss, security breach). HIGH = internal team velocity blocked (CI, staging, sprint). This distinction applies even when the email uses urgent language — impact radius determines level, not tone."

That rule is injected into the module's instructions. You can read it, audit it, and edit it if it's wrong.

See the full example with training data →

Key Features

One API Call

Pass your module, examples, and metric. Get back an optimized module with better instructions.

Learns from Failures

Extracts the rule that explains each failure — not just collects examples of getting it right.

Validates Every Rule

Each candidate rule is tested on held-out examples before being applied. Regressions are rejected.

Interpretable

Learned rules are plain English. You can read exactly what changed and why.

Works with DSPy

Pass any dspy.Module, get back an optimized dspy.Module. Your signature and structure are unchanged.

Two Optimizers

ContraPromptOptimizer for classification tasks. PromptGradOptimizer for generation and rubric-based metrics.

How It Works

Solve with retries

Each training example is run through your module. On failure, the module retries using your metric's feedback field as a hint. This generates contrastive pairs — the wrong attempt and the corrected one.

Mine the signal

The optimizer selects pairs where the gap between failure and success is largest. These are the cases that most clearly reveal where the model's understanding breaks down.

Extract rules

An LLM analyzes the pairs and generates candidate rules: "When X happens, do Y instead of Z." The feedback from your metric shapes these rules directly — more specific feedback produces more precise rules.

Validate

Each candidate rule is tested independently on held-out examples. A rule is only accepted if it improves the score without causing regressions elsewhere.

Synthesize and inject

Accepted rules are merged into clear instructions and injected into your module's prompt. The original signature and structure are preserved — only the instructions change.

Iterate

The loop repeats with the updated module. Each round builds on the last. Early stopping triggers when no further improvement is found.

Pricing

Simple, predictable pricing. One credit = one optimize() call.

PlanPriceCredits/MonthBest For
Free$010Trying it out
Pro$20200Indie devs
Enterprise$2001,000Scale

View full pricing →

Get Started

On this page