Regression-check walkthrough¶
regression_check is the second experiment shape, added in v0.2. Use it when you have a known-good baseline plus a candidate change and want a verdict on whether the change introduces regressions. The failure_rescue shape (the v0.1 default) is the inverse — known-bad failures plus a proposed rescue.
When to use which shape¶
Situation |
Shape |
|---|---|
Hot-fix for a class of failed traces; want to verify the fix rescues failures without regressing healthy ones |
|
Prompt change, model swap, or new tool definition; want to verify production-quality traces stay green |
|
Continuous PR gate where the proposed change isn’t tied to a known failure class |
|
Exploratory A/B between two prompt variants without a target verdict |
(deferred to a future shape) |
The shapes are not interchangeable. Picking regression_check when you actually have failures to rescue silently skips the failure-cohort guards and produces a verdict that doesn’t answer your real question.
Minimal config¶
experiment_shape: regression_check
source:
adapter: langfuse
target:
runner: "python:my_agent.replay:run"
selection:
baseline_cohort:
limit: 40
filter: "score-above:0.8,since:7d"
# NOTE: no failure_cohort. The config-loader rejects it for
# regression_check (named-field error) — the cohort is meaningless
# when the baseline itself is what's under test.
change:
system_prompt: "Updated system prompt under test..."
scorer:
adapter: inspect_ai
score_fn: "python:my_pkg.scorers:faithfulness"
judge_provider: anthropic
judge_model_id: claude-haiku-4-5
rubric_id: faithfulness-v1
rubric_text: "Score 0-1 by faithfulness to the original output."
decision: {}
reporting:
profile: default
timeouts:
replay_seconds: 60.0
score_seconds: 30.0
Run it the same way:
export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export ANTHROPIC_API_KEY="sk-ant-..."
whatifd fork --config whatifd.config.yaml
What the report looks like¶
The verdict surface and exit codes are identical to failure_rescue:
Exit
0→Ship(no significant regressions detected)Exit
1→Don't Ship(regression signal exceeds policy thresholds)Exit
2→Inconclusive(floor failure, setup failure, or CI not available)
Two report-shape differences worth knowing:
cohort_resultscontains only the baseline cohort. Walkthrough renderers and downstream tooling that loop overcohort_resultssee one entry instead of two.methodology.cohortsdeclares("baseline",)instead of("failure", "baseline")— preserving cardinal #10’s “claims must match the design.” A consumer that asserts on the cohort list will see the regression-check shape declared up-front.
The decision_findings surface still contains regression evidence, just framed against the baseline-on-baseline diff rather than the failure-rescue paired comparison.
How the guard chain differs¶
The verdict-policy guard chain branches on experiment_shape:
Guard |
|
|
|---|---|---|
|
runs |
runs |
|
runs |
runs |
|
runs |
runs |
|
runs |
runs |
|
runs |
runs |
|
runs |
skipped |
|
runs |
skipped |
The skipped guards are about failure-cohort movement; they don’t apply when there’s no failure cohort. The floor still applies — a regression_check run with stale cache or unavailable CI bounds still produces Inconclusive regardless of policy.
Cardinal-#10 implications¶
The primary_endpoint guard adapts to the shape: for regression_check the primary endpoint is the baseline cohort’s metric stability (the “no regression” claim), where for failure_rescue it’s the paired-delta between failure-original and failure-replayed. The methodology disclosure in the report makes the active endpoint explicit; consumers don’t have to infer it from the shape.
Common pitfalls¶
Cohort filter pulls regressed traces. If your baseline filter accidentally selects traces that already regressed, the verdict says “stable” because the trace baseline already encoded the regression. Use
score-above:filters that exceed the historical baseline median, plussince:bounds tight enough to exclude the candidate-change window.Sample size too small. Cohort
limit: 40is a reasonable default;limit: 10produces wide CIs and tends towardInconclusiveeven on real changes. Cardinal #10 enforces “CI must be available” — small samples hitci_availabilityfloor failure first.Wrong shape silently skips guards. Picking
regression_checkwhen you actually have a known-bad cohort appears to work but skips the failure-side checks. The verdict will be Ship as long as the baseline holds, even if the failures didn’t change. Match the shape to the question.
Programmatic equivalent¶
If you’d rather drive this programmatically (e.g., for a custom test harness):
from whatifd.pipeline import run_pipeline
from whatifd.types.policy import TrustFloor, DecisionPolicy
from whatifd.types.manifest import RunManifest
manifest = RunManifest(
# ... ordinary RunManifest fields ...
experiment_shape="regression_check",
)
report = run_pipeline(
trace_source=...,
delta_fn=...,
floor=TrustFloor(),
policy=DecisionPolicy(),
runtime=manifest,
methodology=...,
cache_summary=...,
)
experiment_shape is a RunManifest field; the CLI threads it from the config YAML into the manifest, but constructing the manifest yourself works the same way.