Runner contract¶

A trace from your tracer contains the inputs, outputs, spans, tool calls, metadata, scores, and prompts of a single agent run. It is not executable. To replay an agent with a modified config, whatif needs an executable boundary that you supply.

That boundary is the runner contract.

The contract¶

# my_agent/replay.py
from whatif.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput

def run(
    trace_input: TraceInput,
    config: ReplayConfig,
    tool_cache: ToolCache,
) -> ReplayOutput:
    """Re-execute the agent for a single trace with the proposed config change."""
    agent = build_agent(
        system_prompt=config.system_prompt,
        tool_cache=tool_cache,
    )
    text = agent.run(trace_input.user_message)
    return ReplayOutput(text=text)

Then point - -target` at it:

whatif fork --target "python:my_agent.replay:run" ...

What the runner is responsible for¶

Only one thing: produce a fresh ReplayOutput for a given input + modified config. That’s it.

The runner receives:

`TraceInput - the original user input recovered from the trace.
`ReplayConfig - the proposed change applied verbatim.
`ToolCache - cached tool outputs from the original trace; look these up before calling tools live, so replay doesn’t re-fire side effects.

The runner returns:

`ReplayOutput - the agent’s final text response and any per-tool span data the agent collected.

What `whatif` is responsible for¶

Everything else:

Pulling the original trace from your tracer.
Constructing the cohort labels (failure vs baseline).
Owning the original output from the trace.
Constructing the comparison unit (ScoreCase).
Running the scorer against original vs replayed.
Computing per-trace deltas, aggregate stats, bootstrap CIs.
Selecting the representative evidence cases.
Producing the verdict and the report.

This separation is deliberate. The user runner only does the part that requires knowledge of your agent. Everything that’s the same across projects stays inside whatif.

The score case¶

Internally, the unit handed to scorers is:

# whatif-internal; users never construct this themselves.
ScoreCase(
    trace_id: str,
    cohort: Literal["failure", "baseline"],
    input: TraceInput,
    original_output: TraceOutput,         # owned by whatif from the trace
    replayed_output: ReplayOutput,        # produced by user runner
    metadata: dict,
)

You don’t construct ScoreCase yourself. It’s exposed in the public API only so that custom scorer plugins (v0.2+) have a clear type to consume.

Replay strategies considered¶

Option	Trade-off	Decision
A. User-supplied target (chosen)	Most general. Works for raw SDKs, LangChain, LangGraph, OpenAI Assistants.	v0.1 ships this.
B. Framework-specific replay	Easier for `whatif` to ship; narrows adoption to one stack.	Rejected-too narrow.
C. LLM-call replay only	Easier still, but breaks the agent-replay pitch.	Rejected-undersells the product.

Reference adapters¶

Adapter	Status
Raw SDK / minimal agent (Anthropic)	v0.1-complete
LangChain	v0.1.1-stub + docs
LangGraph	v0.1.1-stub + docs

Reference adapters are deliberately small (~30 lines each for the simplest case). They prove the contract without fighting framework abstractions.

Cache policies¶

Policy	When to use	Notes
`use-original` (default v0.1)	Most workloads. Tools that return stable data given the same args.	Strict-if the cache misses, the trace is recorded as a replay failure, surfacing in the report’s Replay validity section.
`live`	Tools that return time-sensitive data (current weather, current price). Per-tool allowlist.	v0.3, opt-in.
`mock`	Tests; experimentation with deliberately swapped tool outputs.	v0.3+.

The strict default is intentional: the report’s Replay validity section makes cache misses visible rather than papering over them.