Regression-check walkthrough

regression_check is the second experiment shape, added in v0.2. Use it when you have a known-good baseline plus a candidate change and want a verdict on whether the change introduces regressions. The failure_rescue shape (the v0.1 default) is the inverse — known-bad failures plus a proposed rescue.

When to use which shape

Situation

Shape

Hot-fix for a class of failed traces; want to verify the fix rescues failures without regressing healthy ones

failure_rescue

Prompt change, model swap, or new tool definition; want to verify production-quality traces stay green

regression_check

Continuous PR gate where the proposed change isn’t tied to a known failure class

regression_check

Exploratory A/B between two prompt variants without a target verdict

(deferred to a future shape)

The shapes are not interchangeable. Picking regression_check when you actually have failures to rescue silently skips the failure-cohort guards and produces a verdict that doesn’t answer your real question.

Minimal config

experiment_shape: regression_check
source:
  adapter: langfuse
target:
  runner: "python:my_agent.replay:run"
selection:
  baseline_cohort:
    limit: 40
    filter: "score-above:0.8,since:7d"
  # NOTE: no failure_cohort. The config-loader rejects it for
  # regression_check (named-field error) — the cohort is meaningless
  # when the baseline itself is what's under test.
change:
  system_prompt: "Updated system prompt under test..."
scorer:
  adapter: inspect_ai
  score_fn: "python:my_pkg.scorers:faithfulness"
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}
reporting:
  profile: default
timeouts:
  replay_seconds: 60.0
  score_seconds: 30.0

Run it the same way:

export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export ANTHROPIC_API_KEY="sk-ant-..."

whatifd fork --config whatifd.config.yaml

What the report looks like

The verdict surface and exit codes are identical to failure_rescue:

  • Exit 0Ship (no significant regressions detected)

  • Exit 1Don't Ship (regression signal exceeds policy thresholds)

  • Exit 2Inconclusive (floor failure, setup failure, or CI not available)

Two report-shape differences worth knowing:

  1. cohort_results contains only the baseline cohort. Walkthrough renderers and downstream tooling that loop over cohort_results see one entry instead of two.

  2. methodology.cohorts declares ("baseline",) instead of ("failure", "baseline") — preserving cardinal #10’s “claims must match the design.” A consumer that asserts on the cohort list will see the regression-check shape declared up-front.

The decision_findings surface still contains regression evidence, just framed against the baseline-on-baseline diff rather than the failure-rescue paired comparison.

How the guard chain differs

The verdict-policy guard chain branches on experiment_shape:

Guard

failure_rescue

regression_check

replay_validity

runs

runs

baseline_coverage

runs

runs

ci_availability

runs

runs

cache_staleness

runs

runs

primary_endpoint (cardinal #10)

runs

runs

practical_delta (failure-side movement)

runs

skipped

improvement_observation (failure-cohort rescue evidence)

runs

skipped

The skipped guards are about failure-cohort movement; they don’t apply when there’s no failure cohort. The floor still applies — a regression_check run with stale cache or unavailable CI bounds still produces Inconclusive regardless of policy.

Cardinal-#10 implications

The primary_endpoint guard adapts to the shape: for regression_check the primary endpoint is the baseline cohort’s metric stability (the “no regression” claim), where for failure_rescue it’s the paired-delta between failure-original and failure-replayed. The methodology disclosure in the report makes the active endpoint explicit; consumers don’t have to infer it from the shape.

Common pitfalls

  • Cohort filter pulls regressed traces. If your baseline filter accidentally selects traces that already regressed, the verdict says “stable” because the trace baseline already encoded the regression. Use score-above: filters that exceed the historical baseline median, plus since: bounds tight enough to exclude the candidate-change window.

  • Sample size too small. Cohort limit: 40 is a reasonable default; limit: 10 produces wide CIs and tends toward Inconclusive even on real changes. Cardinal #10 enforces “CI must be available” — small samples hit ci_availability floor failure first.

  • Wrong shape silently skips guards. Picking regression_check when you actually have a known-bad cohort appears to work but skips the failure-side checks. The verdict will be Ship as long as the baseline holds, even if the failures didn’t change. Match the shape to the question.

Programmatic equivalent

If you’d rather drive this programmatically (e.g., for a custom test harness):

from whatifd.pipeline import run_pipeline
from whatifd.types.policy import TrustFloor, DecisionPolicy
from whatifd.types.manifest import RunManifest

manifest = RunManifest(
    # ... ordinary RunManifest fields ...
    experiment_shape="regression_check",
)

report = run_pipeline(
    trace_source=...,
    delta_fn=...,
    floor=TrustFloor(),
    policy=DecisionPolicy(),
    runtime=manifest,
    methodology=...,
    cache_summary=...,
)

experiment_shape is a RunManifest field; the CLI threads it from the config YAML into the manifest, but constructing the manifest yourself works the same way.