Inspect AI¶

Inspect AI is whatifd’s default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatifd wraps its scorers rather than reimplementing scoring from scratch. The adapter shipped in v0.1 as a programmatic-only surface; v0.2 added YAML config support via scorer.score_fn so the entire pipeline is reachable from whatifd.config.yaml.

How `whatifd` uses Inspect AI¶

For each (original_output, replayed_output) pair, whatifd:

Constructs an Inspect AI Sample (input + target).
Runs the configured Inspect scorer against the original output, then against the replayed output.
Records both scores plus the judge’s rationale.
The diff between the two becomes the per-trace delta in the report.

Configuration¶

whatifd fork is config-file driven. The scorer block in whatifd.config.yaml:

scorer:
  adapter: inspect_ai
  score_fn: "python:my_pkg.scorers:faithfulness"   # your Inspect score function
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
  cache_mode: auto                                  # auto | on | off | read_only | refresh
  # optional:
  # judge_model_snapshot: claude-haiku-4-5-20251001
  # scoring_parameters: { temperature: 0.0, max_tokens: 1024 }

scorer.score_fn is a python:<module.path>:<attr> reference — same shape as target.runner. The resolver imports the module and calls getattr(module, attr); the resulting callable is wired into InspectAIScorer.score_fn. The five fields score_fn, judge_provider, judge_model_id, rubric_id, rubric_text are all required when adapter: inspect_ai; misconfigured runs fail at config-load with a named-field error.

Programmatic API (still supported)¶

If you’d rather construct InspectAIScorer in Python (e.g., for a custom test harness or non-CLI workflow), the programmatic path is still first-class:

from inspect_ai.scorer import faithfulness
from whatifd_inspect_ai import InspectAIScorer
from whatifd.pipeline import run_pipeline

scorer = InspectAIScorer(
    score_fn=faithfulness(),
    judge_provider="anthropic",
    judge_model_id="claude-haiku-4-5",
    rubric_id="faithfulness-v1",
    rubric_text="Score 0-1 by faithfulness to the original output.",
)

report = run_pipeline(
    trace_source=...,
    delta_fn=...,
    floor=...,
    policy=...,
    runtime=...,
    methodology=...,
    cache_summary=...,
)

The CLI YAML form above and the programmatic form here construct identical InspectAIScorer instances; pick whichever fits your workflow.

Built-in scorers (v0.1)¶

Task	What it measures	When to use
`faithfulness`	Whether the response is grounded in the provided context (no hallucination).	RAG systems, agents that cite tool outputs.
`model_graded_qa`	LLM-as-judge: did the response correctly answer the question?	General-purpose Q&A.
`match`	Exact / substring match against a target.	Structured outputs, classification.
`f1`	Token-level F1 against a target.	Extraction, summarization.

Custom Inspect tasks¶

If your team already has Inspect tasks defined, point whatifd at them:

scorer:
  type: inspect_ai
  task: my_team.tasks.triage_quality   # Python module path
  judge_model: claude-opus-4-7

whatifd resolves the task from your Python path, so anything Inspect can run, whatifd can score against.

Judge rationale¶

whatifd requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.

Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.

Why Inspect AI rather than X¶

Alternative	Why not (yet)
RAGAS	On the v0.3 roadmap; covers RAG-specific metrics not in Inspect.
Promptfoo	Designed for golden-set CI; less natural fit for trace-driven workflows.
Custom in-house scorers	Plugin interface on the v0.3 roadmap; current path is a thin Inspect wrapper.
Hand-rolled judges	Possible via custom Inspect tasks.

Cost considerations¶

LLM-as-judge scoring costs LLM tokens. A typical run:

40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.
With Claude Haiku 4.5 as judge: ~$0.10–0.30 per whatifd fork invocation.
With Claude Opus as judge: ~$1–4 per invocation.

Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.