The workflow¶
whatifd runs a six-step loop that turns a proposed LLM change into a defensible decision.

The six steps¶
1. Observe production behavior¶
Inputs come from your tracer of choice: alerts, incidents, failed traces from Langfuse / LangSmith / Phoenix, plus an optional baseline of recently-successful traces.
In v0.2, baseline coverage is required for the failure_rescue experiment shape — selection.baseline_cohort and selection.failure_cohort are both required at config-load. There is no opt-out toggle; the dual-cohort requirement is what makes the failure-rescue verdict defensible. If you only have a baseline (no known failures), use experiment_shape: regression_check instead — that shape requires only selection.baseline_cohort (and rejects failure_cohort at config-load).
2. Define the experiment¶
whatifd fork \
--source langfuse \
--target "python:my_agent.replay:run" \
--change system_prompt=prompts/v3.txt \
--tool-cache use-original \
--score inspect_ai:faithfulness
A change can be a new system prompt (v0.1), a model swap (v0.2), or a tool / parameter override (v0.3). Selection policy and scorer are pluggable.
3. Fork and replay¶
whatifd selects matching traces from your source, applies the change, and replays each through your - -target`. Original tool outputs are reused from the cache so destructive side effects don’t re-fire.
A replay validation report is produced inline: how many traces were replayable, how many were skipped, and why.
4. Score and diff¶
Each trace produces an original and a replayed output. The scorer (Inspect AI by default) computes per-trace deltas. Aggregate stats include bootstrap confidence intervals and per-cohort breakdowns.
5. Evidence-backed verdict¶
The report has five mandatory sections: Verdict, Stats, Replay validity, Baseline integrity, Evidence. The Evidence section includes representative improvements and regressions with the judge’s rationale-numbers without rationale are not trustworthy enough to ship from.
6. Decide and ship¶
Markdown report for humans. JSON report for machines. Exit codes that reflect your declared decision policy:
Exit code |
Meaning |
|---|---|
|
Passed configured policy. |
|
Failed configured policy. |
|
Inconclusive (setup / replay / scoring failure). |
whatifd enforces your declared policy. It does not certify “safety.”
Path Z-the same loop, in CI¶
The CLI is the wedge; CI integration is the moat. The same six steps wired into a GitHub Action become a pre-merge regression gate for LLM behavior. See Path Z.
What replaces the manual workflow¶
Before whatifd, the loop was: copy traces → paste into a playground → manual scoring → eyeball the diff. Five fragmented tools, ~30 minutes per hypothesis, low decision quality.
After whatifd: one CLI command, ~5 minutes per hypothesis, evidence-backed verdict. The same loop, made into a single coherent workflow.