Inspect AI¶
Inspect AI is whatifd’s default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatifd wraps its scorers rather than reimplementing scoring from scratch. The adapter shipped in v0.1 as a programmatic-only surface; v0.2 added YAML config support via scorer.score_fn so the entire pipeline is reachable from whatifd.config.yaml.
How whatifd uses Inspect AI¶
For each (original_output, replayed_output) pair, whatifd:
Constructs an Inspect AI
Sample(input + target).Runs the configured Inspect scorer against the original output, then against the replayed output.
Records both scores plus the judge’s rationale.
The diff between the two becomes the per-trace delta in the report.
Configuration¶
whatifd fork is config-file driven. The scorer block in whatifd.config.yaml:
scorer:
adapter: inspect_ai
score_fn: "python:my_pkg.scorers:faithfulness" # your Inspect score function
judge_provider: anthropic
judge_model_id: claude-haiku-4-5
rubric_id: faithfulness-v1
rubric_text: "Score 0-1 by faithfulness to the original output."
cache_mode: auto # auto | on | off | read_only | refresh
# optional:
# judge_model_snapshot: claude-haiku-4-5-20251001
# scoring_parameters: { temperature: 0.0, max_tokens: 1024 }
scorer.score_fn is a python:<module.path>:<attr> reference — same shape as target.runner. The resolver imports the module and calls getattr(module, attr); the resulting callable is wired into InspectAIScorer.score_fn. The five fields score_fn, judge_provider, judge_model_id, rubric_id, rubric_text are all required when adapter: inspect_ai; misconfigured runs fail at config-load with a named-field error.
Programmatic API (still supported)¶
If you’d rather construct InspectAIScorer in Python (e.g., for a custom test harness or non-CLI workflow), the programmatic path is still first-class:
from inspect_ai.scorer import faithfulness
from whatifd_inspect_ai import InspectAIScorer
from whatifd.pipeline import run_pipeline
scorer = InspectAIScorer(
score_fn=faithfulness(),
judge_provider="anthropic",
judge_model_id="claude-haiku-4-5",
rubric_id="faithfulness-v1",
rubric_text="Score 0-1 by faithfulness to the original output.",
)
report = run_pipeline(
trace_source=...,
delta_fn=...,
floor=...,
policy=...,
runtime=...,
methodology=...,
cache_summary=...,
)
The CLI YAML form above and the programmatic form here construct identical InspectAIScorer instances; pick whichever fits your workflow.
Built-in scorers (v0.1)¶
Task |
What it measures |
When to use |
|---|---|---|
|
Whether the response is grounded in the provided context (no hallucination). |
RAG systems, agents that cite tool outputs. |
|
LLM-as-judge: did the response correctly answer the question? |
General-purpose Q&A. |
|
Exact / substring match against a target. |
Structured outputs, classification. |
|
Token-level F1 against a target. |
Extraction, summarization. |
Custom Inspect tasks¶
If your team already has Inspect tasks defined, point whatifd at them:
scorer:
type: inspect_ai
task: my_team.tasks.triage_quality # Python module path
judge_model: claude-opus-4-7
whatifd resolves the task from your Python path, so anything Inspect can run, whatifd can score against.
Judge rationale¶
whatifd requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.
Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.
Why Inspect AI rather than X¶
Alternative |
Why not (yet) |
|---|---|
RAGAS |
On the v0.3 roadmap; covers RAG-specific metrics not in Inspect. |
Promptfoo |
Designed for golden-set CI; less natural fit for trace-driven workflows. |
Custom in-house scorers |
Plugin interface on the v0.3 roadmap; current path is a thin Inspect wrapper. |
Hand-rolled judges |
Possible via custom Inspect tasks. |
Cost considerations¶
LLM-as-judge scoring costs LLM tokens. A typical run:
40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.
With Claude Haiku 4.5 as judge: ~$0.10–0.30 per
whatifd forkinvocation.With Claude Opus as judge: ~$1–4 per invocation.
Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.