The workflow¶

whatifd runs a six-step loop that turns a proposed LLM change into a defensible decision.

whatifd workflow

The six steps¶

1. Observe production behavior¶

Inputs come from your tracer of choice: alerts, incidents, failed traces from Langfuse / LangSmith / Phoenix, plus an optional baseline of recently-successful traces.

In v0.2, baseline coverage is required for the failure_rescue experiment shape — selection.baseline_cohort and selection.failure_cohort are both required at config-load. There is no opt-out toggle; the dual-cohort requirement is what makes the failure-rescue verdict defensible. If you only have a baseline (no known failures), use experiment_shape: regression_check instead — that shape requires only selection.baseline_cohort (and rejects failure_cohort at config-load).

2. Define the experiment¶

whatifd fork \
    --source langfuse \
    --target "python:my_agent.replay:run" \
    --change system_prompt=prompts/v3.txt \
    --tool-cache use-original \
    --score inspect_ai:faithfulness

A change can be a new system prompt (v0.1), a model swap (v0.2), or a tool / parameter override (v0.3). Selection policy and scorer are pluggable.

3. Fork and replay¶

whatifd selects matching traces from your source, applies the change, and replays each through your - -target`. Original tool outputs are reused from the cache so destructive side effects don’t re-fire.

A replay validation report is produced inline: how many traces were replayable, how many were skipped, and why.

4. Score and diff¶

Each trace produces an original and a replayed output. The scorer (Inspect AI by default) computes per-trace deltas. Aggregate stats include bootstrap confidence intervals and per-cohort breakdowns.

5. Evidence-backed verdict¶

The report has five mandatory sections: Verdict, Stats, Replay validity, Baseline integrity, Evidence. The Evidence section includes representative improvements and regressions with the judge’s rationale-numbers without rationale are not trustworthy enough to ship from.

6. Decide and ship¶

Markdown report for humans. JSON report for machines. Exit codes that reflect your declared decision policy:

Exit code	Meaning
`0`	Passed configured policy.
`1`	Failed configured policy.
`2`	Inconclusive (setup / replay / scoring failure).

whatifd enforces your declared policy. It does not certify “safety.”

Path Z-the same loop, in CI¶

The CLI is the wedge; CI integration is the moat. The same six steps wired into a GitHub Action become a pre-merge regression gate for LLM behavior. See Path Z.

What replaces the manual workflow¶

Before whatifd, the loop was: copy traces → paste into a playground → manual scoring → eyeball the diff. Five fragmented tools, ~30 minutes per hypothesis, low decision quality.

After whatifd: one CLI command, ~5 minutes per hypothesis, evidence-backed verdict. The same loop, made into a single coherent workflow.