Choosing an Optimizer
When to use ContraPrompt vs PromptGrad
When to Use ContraPrompt
ContraPrompt is the right choice when model performance on the task is high-variance — the model sometimes gets things right and sometimes wrong on similar inputs. The algorithm mines its learning signal from the contrast between failure and success, so it needs that variance to exist.
This makes it naturally suited to tasks with diverse, categorizable failure modes: retrieval-augmented QA (HotPotQA), multi-label domain classification (GDPR-Bench), and expert-level reasoning (GPQA) all have distinct error types per example, and ContraPrompt's grouped synthesis writes a targeted rule for each failure category. Empirically it is the strongest optimizer on all three.
If the task has multiple qualitatively different ways to fail — and the model can recover from at least some of them — ContraPrompt will extract richer and more specific rules than PromptGrad can.
When to Use PromptGrad
PromptGrad is the right choice when model failures are low-variance and systematic — the model fails in the same predictable ways across the training set, regardless of retries. It does not need a single successful retry; it only needs consistent failure patterns in a batch to compute a useful gradient.
This makes it structurally better for tasks where the model is near its capability ceiling and self-correction rarely helps. It is also the safer choice when the baseline is already high: its strict per-rule validation acts as a strong regularizer — on MATH-500 it accepted 2 rules out of 54 candidates and still improved, then rolled back when a later epoch added 14 rules and performance dropped. The discipline is the feature.
As a rule of thumb: run a quick retry probe before choosing — if the model's retry success rate is near zero, use PromptGrad; if there is meaningful improvement on retry, use ContraPrompt.
Decision Rules
Failures are high-variance (model sometimes gets it right on retry) → ContraPrompt
Failures are low-variance and systematic (model fails the same way regardless of retries) → PromptGrad
Baseline is already high (>75%) → PromptGrad (its strict rule validation prevents regression)
Task has multiple distinct error categories → ContraPrompt (grouped synthesis targets each category separately)