Config reference¶
whatifd fork is config-file driven. The default path is whatifd.config.yaml in the current directory; override with --config <path>. YAML and JSON are both accepted. The schema is strict (Pydantic v2 extra="forbid") — unknown fields fail at load time with a hint message, rather than silently absorb.
Full example¶
source:
adapter: langfuse # or "stub" for offline use
target:
runner: "python:my_agent.replay:run"
selection:
failure_cohort:
limit: 20
filter: "score-below:0.6,since:24h" # optional; adapter-interpreted
baseline_cohort:
limit: 20
filter: "score-above:0.8,since:24h"
change:
system_prompt: "You are a senior on-call SRE..."
# model: "claude-haiku-4-5" # optional model swap
scorer:
adapter: inspect_ai # or "stub"
cache_mode: auto # auto | on | off | read_only | refresh
decision:
require_baseline: true
max_baseline_regression_ratio: 0.10
min_failure_improvement_ratio: 0.50
practical_delta_epsilon: 0.05
# max_ci_width: 0.5 # optional cap on CI width
reporting:
profile: default # default | review | minimal | forensic
timeouts:
replay_seconds: 60.0
score_seconds: 30.0
The five required top-level sections are source, target, selection, change, scorer. The decision, reporting, and timeouts sections are required as keys but every field has a default, so an empty decision: {} block is enough.
experiment_shape¶
Top-level field (v0.2). Controls which experiment shape the verdict pipeline runs.
Value |
When to use |
Selection requirements |
|---|---|---|
|
You have a known-bad set of traces plus a proposed fix; verdict says whether the fix rescues failures without regressing the baseline. |
Requires both |
|
You have a known-good baseline plus a candidate change; verdict says whether the change introduces regressions. |
Requires only |
The verdict-policy guard chain branches on this field: regression_check skips the failure-cohort-specific guards (practical_delta, improvement_observation) and runs the lean primary_endpoint + ci_availability chain. Floor + decision policy still apply.
Unknown values (e.g., exploratory_ab) fail at config-load with a Pydantic ValidationError naming the field.
source¶
Trace source adapter reference.
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
required |
Adapter name. Shipped today: |
Phoenix / OpenInference example¶
source:
adapter: phoenix
spans_provider: "python:my_pkg.phoenix_wiring:get_spans" # required
spans_provider is a python:<module>:<attr> reference to a Callable[[], Iterable[dict]] that yields OpenInference-shaped span dicts. The adapter is tracer-neutral: the callable can wrap arize-phoenix-client, read a JSONL dump, or pull from any OTLP destination. See Arize Phoenix / OpenInference for wiring examples.
target¶
The user-supplied runner.
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
required |
|
selection¶
Per-cohort selection limits. Required sub-blocks depend on experiment_shape: failure_rescue (the v0.1 default) requires both failure_cohort and baseline_cohort — the failure-rescue verdict needs paired evidence under cardinal-#10. regression_check (added in v0.2) requires only baseline_cohort; the failure cohort is meaningless when the baseline itself is what’s under test.
Path |
Type |
Default |
Notes |
|---|---|---|---|
|
int ≥ 1 |
required |
Max traces in the failure cohort. |
|
string |
none |
Adapter-interpreted selector (e.g., Langfuse |
|
int ≥ 1 |
required |
Max traces in the baseline cohort. |
|
string |
none |
Adapter-interpreted selector. |
change¶
The proposed change. Mirrors whatifd.contract.ReplayConfig keys.
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
none |
The proposed system-prompt change. |
|
string |
none |
Optional model swap (e.g., |
whatifd supports system_prompt and model. Other dimensions (tool list, temperature) remain on the roadmap.
scorer¶
Scorer adapter reference.
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
enum |
required |
|
|
string | null |
|
|
|
string | null |
|
LLM judge provider (e.g., |
|
string | null |
|
Judge model identifier (e.g., |
|
string | null |
|
Optional snapshot pin (e.g., |
|
string | null |
|
Human-named rubric identifier. Required when |
|
string | null |
|
Literal rubric text; hashed into cache keys so a rubric edit invalidates entries. Required when |
|
dict[str, JSON-primitive] |
|
Optional knobs (temperature, max_tokens, …) passed through to the scorer. Values must be |
|
enum |
|
One of |
When adapter: inspect_ai, the five required fields (score_fn, judge_provider, judge_model_id, rubric_id, rubric_text) are enforced by a Pydantic cross-field validator at config-load time; misconfigured runs fail with a named-field error before any adapter machinery starts up. When adapter: stub, the inspect_ai-specific fields are silently accepted (so a config block can be retargeted from stub to inspect_ai with one keystroke during development).
decision¶
Above-floor policy thresholds. These mirror whatifd.types.policy.DecisionPolicy.
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
bool |
|
Floor refuses runs without a baseline cohort. |
|
float in |
|
If more than 10% of baseline traces regress, verdict is |
|
float in |
|
If less than 50% of failure-cohort traces improve, the change isn’t rescuing → |
|
float ≥ 0 |
|
Minimum effect size to call a delta “practical.” Below this is “no meaningful change.” |
|
float ≥ 0 | null |
|
Optional cap on CI width. The lever for accepting wider CIs without flipping |
The trust floor (which cannot be overridden by config — see cardinal #2) sits structurally below this. Floor failures produce Inconclusive regardless of policy.
reporting¶
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
enum |
|
One of |
|
block |
none |
Required when |
|
string |
required (in block) |
Operator ID. |
|
ISO 8601 string |
required (in block) |
Date or datetime. |
|
string |
required (in block) |
Free-text justification recorded in the manifest. |
timeouts¶
Field |
Type |
Default |
Notes |
|---|---|---|---|
|
float > 0 |
|
Per-trace replay wall-clock budget. Exceeded → |
|
float > 0 |
|
Per-trace scoring budget. |
Validation hints¶
Loaded by whatifd.config.load_config(...). On validation failure, format_validation_errors translates Pydantic’s stack-trace-flavored messages into multi-line per-error output with suggestions for the most common typos (e.g., forensic_ackn0wledgment → “did you mean forensic_acknowledgment?”). All hints fall back to the raw Pydantic message — operators see useful output either way.