Runner contract¶
A trace from your tracer contains the inputs, outputs, spans, tool calls, metadata, scores, and prompts of a single agent run. It is not executable. To replay an agent with a modified config, whatifd needs an executable boundary that you supply.
That boundary is the runner contract.
The contract¶
# my_agent/replay.py
from whatifd.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
def run(
trace_input: TraceInput,
config: ReplayConfig,
tool_cache: ToolCache,
) -> ReplayOutput:
"""Re-execute the agent for a single trace with the proposed config change."""
agent = build_agent(
system_prompt=config.system_prompt,
tool_cache=tool_cache,
)
text = agent.run(trace_input.user_message)
return ReplayOutput(text=text)
Then point - -target` at it:
whatifd fork --target "python:my_agent.replay:run" ...
What the runner is responsible for¶
Only one thing: produce a fresh ReplayOutput for a given input + modified config. That’s it.
The runner receives:
`TraceInput - the original user input recovered from the trace.
`ReplayConfig - the proposed change applied verbatim.
`ToolCache - cached tool outputs from the original trace; look these up before calling tools live, so replay doesn’t re-fire side effects.
The runner returns:
`ReplayOutput - the agent’s final text response and any per-tool span data the agent collected.
What whatifd is responsible for¶
Everything else:
Pulling the original trace from your tracer.
Constructing the cohort labels (failure vs baseline).
Owning the original output from the trace.
Constructing the comparison unit (
ScoreCase).Running the scorer against original vs replayed.
Computing per-trace deltas, aggregate stats, bootstrap CIs.
Selecting the representative evidence cases.
Producing the verdict and the report.
This separation is deliberate. The user runner only does the part that requires knowledge of your agent. Everything that’s the same across projects stays inside whatifd.
Non-Python runners — the exec: lane¶
python:<module>:<attr> is the default, but your agent doesn’t have to be in Python. The exec: lane (whatifd-exec/1) runs your replay entry point as a child process that speaks a small line-buffered NDJSON protocol over stdin/stdout — so any language satisfies the same runner contract in ~50 lines, no SDK:
target:
runner: "exec:node ./replay-agent.js" # any argv; POSIX-only in v1
The child handshakes (hello), then for each trace receives a replay_request and returns a replay_response whose output mirrors ReplayOutput exactly — validated through the same schema as the Python lane, so Sensitive[T] enforcement is identical. Cache keying stays in whatifd: when the child needs a cached tool output it sends a tool_lookup and the parent answers from the canonical cache (the guest never re-implements keying). Validate a runner before a full run with whatifd exec-check:
whatifd exec-check "exec:node ./replay-agent.js"
A zero-dependency Node reference runner ships in the repo at examples/exec_agent_node/. The full wire contract is in the exec: runner spec.
The score case¶
Internally, the unit handed to scorers is:
# whatifd-internal; users never construct this themselves.
ScoreCase(
trace_id: str,
cohort: Literal["failure", "baseline"],
input: TraceInput,
original_output: TraceOutput, # owned by whatifd from the trace
replayed_output: ReplayOutput, # produced by user runner
metadata: dict,
)
You don’t construct ScoreCase yourself. It’s exposed in the public API only so that custom scorer plugins (a v0.3+ surface) have a clear type to consume.
Replay strategies considered¶
Option |
Trade-off |
Decision |
|---|---|---|
A. User-supplied target (chosen) |
Most general. Works for raw SDKs, LangChain, LangGraph, OpenAI Assistants. |
v0.1 ships this. |
B. Framework-specific replay |
Easier for |
Rejected-too narrow. |
C. LLM-call replay only |
Easier still, but breaks the agent-replay pitch. |
Rejected-undersells the product. |
Reference adapters¶
Adapter |
Status |
|---|---|
Raw SDK / minimal agent (Anthropic) |
v0.1-complete |
LangChain |
v0.1.1-stub + docs |
LangGraph |
v0.1.1-stub + docs |
Reference adapters are deliberately small (~30 lines each for the simplest case). They prove the contract without fighting framework abstractions.
Cache policies¶
Policy |
When to use |
Notes |
|---|---|---|
|
Most workloads. Tools that return stable data given the same args. |
Strict-if the cache misses, the trace is recorded as a replay failure, surfacing in the report’s Replay validity section. |
|
Tools that return time-sensitive data (current weather, current price). Per-tool allowlist. |
v0.3, opt-in. |
|
Tests; experimentation with deliberately swapped tool outputs. |
v0.3+. |
The strict default is intentional: the report’s Replay validity section makes cache misses visible rather than papering over them.