Runner contract¶
A trace from your tracer contains the inputs, outputs, spans, tool calls, metadata, scores, and prompts of a single agent run. It is not executable. To replay an agent with a modified config, whatif needs an executable boundary that you supply.
That boundary is the runner contract.
The contract¶
# my_agent/replay.py
from whatif.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput
def run(
trace_input: TraceInput,
config: ReplayConfig,
tool_cache: ToolCache,
) -> ReplayOutput:
"""Re-execute the agent for a single trace with the proposed config change."""
agent = build_agent(
system_prompt=config.system_prompt,
tool_cache=tool_cache,
)
text = agent.run(trace_input.user_message)
return ReplayOutput(text=text)
Then point - -target` at it:
whatif fork --target "python:my_agent.replay:run" ...
What the runner is responsible for¶
Only one thing: produce a fresh ReplayOutput for a given input + modified config. That’s it.
The runner receives:
`TraceInput - the original user input recovered from the trace.
`ReplayConfig - the proposed change applied verbatim.
`ToolCache - cached tool outputs from the original trace; look these up before calling tools live, so replay doesn’t re-fire side effects.
The runner returns:
`ReplayOutput - the agent’s final text response and any per-tool span data the agent collected.
What whatif is responsible for¶
Everything else:
Pulling the original trace from your tracer.
Constructing the cohort labels (failure vs baseline).
Owning the original output from the trace.
Constructing the comparison unit (
ScoreCase).Running the scorer against original vs replayed.
Computing per-trace deltas, aggregate stats, bootstrap CIs.
Selecting the representative evidence cases.
Producing the verdict and the report.
This separation is deliberate. The user runner only does the part that requires knowledge of your agent. Everything that’s the same across projects stays inside whatif.
The score case¶
Internally, the unit handed to scorers is:
# whatif-internal; users never construct this themselves.
ScoreCase(
trace_id: str,
cohort: Literal["failure", "baseline"],
input: TraceInput,
original_output: TraceOutput, # owned by whatif from the trace
replayed_output: ReplayOutput, # produced by user runner
metadata: dict,
)
You don’t construct ScoreCase yourself. It’s exposed in the public API only so that custom scorer plugins (v0.2+) have a clear type to consume.
Replay strategies considered¶
Option |
Trade-off |
Decision |
|---|---|---|
A. User-supplied target (chosen) |
Most general. Works for raw SDKs, LangChain, LangGraph, OpenAI Assistants. |
v0.1 ships this. |
B. Framework-specific replay |
Easier for |
Rejected-too narrow. |
C. LLM-call replay only |
Easier still, but breaks the agent-replay pitch. |
Rejected-undersells the product. |
Reference adapters¶
Adapter |
Status |
|---|---|
Raw SDK / minimal agent (Anthropic) |
v0.1-complete |
LangChain |
v0.1.1-stub + docs |
LangGraph |
v0.1.1-stub + docs |
Reference adapters are deliberately small (~30 lines each for the simplest case). They prove the contract without fighting framework abstractions.
Cache policies¶
Policy |
When to use |
Notes |
|---|---|---|
|
Most workloads. Tools that return stable data given the same args. |
Strict-if the cache misses, the trace is recorded as a replay failure, surfacing in the report’s Replay validity section. |
|
Tools that return time-sensitive data (current weather, current price). Per-tool allowlist. |
v0.3, opt-in. |
|
Tests; experimentation with deliberately swapped tool outputs. |
v0.3+. |
The strict default is intentional: the report’s Replay validity section makes cache misses visible rather than papering over them.