Runner contract

A trace from your tracer contains the inputs, outputs, spans, tool calls, metadata, scores, and prompts of a single agent run. It is not executable. To replay an agent with a modified config, whatif needs an executable boundary that you supply.

That boundary is the runner contract.

The contract

# my_agent/replay.py
from whatif.contract import TraceInput, ReplayConfig, ToolCache, ReplayOutput

def run(
    trace_input: TraceInput,
    config: ReplayConfig,
    tool_cache: ToolCache,
) -> ReplayOutput:
    """Re-execute the agent for a single trace with the proposed config change."""
    agent = build_agent(
        system_prompt=config.system_prompt,
        tool_cache=tool_cache,
    )
    text = agent.run(trace_input.user_message)
    return ReplayOutput(text=text)

Then point - -target` at it:

whatif fork --target "python:my_agent.replay:run" ...

What the runner is responsible for

Only one thing: produce a fresh ReplayOutput for a given input + modified config. That’s it.

The runner receives:

  • `TraceInput - the original user input recovered from the trace.

  • `ReplayConfig - the proposed change applied verbatim.

  • `ToolCache - cached tool outputs from the original trace; look these up before calling tools live, so replay doesn’t re-fire side effects.

The runner returns:

  • `ReplayOutput - the agent’s final text response and any per-tool span data the agent collected.

What whatif is responsible for

Everything else:

  • Pulling the original trace from your tracer.

  • Constructing the cohort labels (failure vs baseline).

  • Owning the original output from the trace.

  • Constructing the comparison unit (ScoreCase).

  • Running the scorer against original vs replayed.

  • Computing per-trace deltas, aggregate stats, bootstrap CIs.

  • Selecting the representative evidence cases.

  • Producing the verdict and the report.

This separation is deliberate. The user runner only does the part that requires knowledge of your agent. Everything that’s the same across projects stays inside whatif.

The score case

Internally, the unit handed to scorers is:

# whatif-internal; users never construct this themselves.
ScoreCase(
    trace_id: str,
    cohort: Literal["failure", "baseline"],
    input: TraceInput,
    original_output: TraceOutput,         # owned by whatif from the trace
    replayed_output: ReplayOutput,        # produced by user runner
    metadata: dict,
)

You don’t construct ScoreCase yourself. It’s exposed in the public API only so that custom scorer plugins (v0.2+) have a clear type to consume.

Replay strategies considered

Option

Trade-off

Decision

A. User-supplied target (chosen)

Most general. Works for raw SDKs, LangChain, LangGraph, OpenAI Assistants.

v0.1 ships this.

B. Framework-specific replay

Easier for whatif to ship; narrows adoption to one stack.

Rejected-too narrow.

C. LLM-call replay only

Easier still, but breaks the agent-replay pitch.

Rejected-undersells the product.

Reference adapters

Adapter

Status

Raw SDK / minimal agent (Anthropic)

v0.1-complete

LangChain

v0.1.1-stub + docs

LangGraph

v0.1.1-stub + docs

Reference adapters are deliberately small (~30 lines each for the simplest case). They prove the contract without fighting framework abstractions.

Cache policies

Policy

When to use

Notes

use-original (default v0.1)

Most workloads. Tools that return stable data given the same args.

Strict-if the cache misses, the trace is recorded as a replay failure, surfacing in the report’s Replay validity section.

live

Tools that return time-sensitive data (current weather, current price). Per-tool allowlist.

v0.3, opt-in.

mock

Tests; experimentation with deliberately swapped tool outputs.

v0.3+.

The strict default is intentional: the report’s Replay validity section makes cache misses visible rather than papering over them.