Inspect AI

Inspect AI is whatifd’s default scorer. It’s a thoughtfully-designed evaluation framework from the UK AI Safety Institute; whatifd wraps its scorers rather than reimplementing scoring from scratch. The adapter shipped in v0.1 as a programmatic-only surface; v0.2 added YAML config support via scorer.score_fn so the entire pipeline is reachable from whatifd.config.yaml.

How whatifd uses Inspect AI

For each (original_output, replayed_output) pair, whatifd:

  1. Constructs an Inspect AI Sample (input + target).

  2. Runs the configured Inspect scorer against the original output, then against the replayed output.

  3. Records both scores plus the judge’s rationale.

  4. The diff between the two becomes the per-trace delta in the report.

Configuration

whatifd fork is config-file driven. The scorer block in whatifd.config.yaml:

scorer:
  adapter: inspect_ai
  score_fn: "python:my_pkg.scorers:faithfulness"   # your Inspect score function
  judge_provider: anthropic
  judge_model_id: claude-haiku-4-5
  rubric_id: faithfulness-v1
  rubric_text: "Score 0-1 by faithfulness to the original output."
  cache_mode: auto                                  # auto | on | off | read_only | refresh
  # optional:
  # judge_model_snapshot: claude-haiku-4-5-20251001
  # scoring_parameters: { temperature: 0.0, max_tokens: 1024 }

scorer.score_fn is a python:<module.path>:<attr> reference — same shape as target.runner. The resolver imports the module and calls getattr(module, attr); the resulting callable is wired into InspectAIScorer.score_fn. The five fields score_fn, judge_provider, judge_model_id, rubric_id, rubric_text are all required when adapter: inspect_ai; misconfigured runs fail at config-load with a named-field error.

Programmatic API (still supported)

If you’d rather construct InspectAIScorer in Python (e.g., for a custom test harness or non-CLI workflow), the programmatic path is still first-class:

from inspect_ai.scorer import faithfulness
from whatifd_inspect_ai import InspectAIScorer
from whatifd.pipeline import run_pipeline

scorer = InspectAIScorer(
    score_fn=faithfulness(),
    judge_provider="anthropic",
    judge_model_id="claude-haiku-4-5",
    rubric_id="faithfulness-v1",
    rubric_text="Score 0-1 by faithfulness to the original output.",
)

report = run_pipeline(
    trace_source=...,
    delta_fn=...,
    floor=...,
    policy=...,
    runtime=...,
    methodology=...,
    cache_summary=...,
)

The CLI YAML form above and the programmatic form here construct identical InspectAIScorer instances; pick whichever fits your workflow.

Built-in scorers (v0.1)

Task

What it measures

When to use

faithfulness

Whether the response is grounded in the provided context (no hallucination).

RAG systems, agents that cite tool outputs.

model_graded_qa

LLM-as-judge: did the response correctly answer the question?

General-purpose Q&A.

match

Exact / substring match against a target.

Structured outputs, classification.

f1

Token-level F1 against a target.

Extraction, summarization.

Custom Inspect tasks

If your team already has Inspect tasks defined, point whatifd at them:

scorer:
  type: inspect_ai
  task: my_team.tasks.triage_quality   # Python module path
  judge_model: claude-opus-4-7

whatifd resolves the task from your Python path, so anything Inspect can run, whatifd can score against.

Judge rationale

whatifd requires the scorer to return both a numeric score and a rationale string. The rationale flows into the Evidence section of the verdict report.

Inspect AI’s model_graded_qa and faithfulness scorers produce rationales by default. If you write a custom scorer, ensure the rationale is populated-numbers without rationale are not trustworthy enough to ship from.

Why Inspect AI rather than X

Alternative

Why not (yet)

RAGAS

On the v0.3 roadmap; covers RAG-specific metrics not in Inspect.

Promptfoo

Designed for golden-set CI; less natural fit for trace-driven workflows.

Custom in-house scorers

Plugin interface on the v0.3 roadmap; current path is a thin Inspect wrapper.

Hand-rolled judges

Possible via custom Inspect tasks.

Cost considerations

LLM-as-judge scoring costs LLM tokens. A typical run:

  • 40 traces (20 failures + 20 baseline) × 2 scorings (original + replayed) = 80 judge calls.

  • With Claude Haiku 4.5 as judge: ~$0.10–0.30 per whatifd fork invocation.

  • With Claude Opus as judge: ~$1–4 per invocation.

Use Haiku for high-frequency CI; reserve Opus for edge cases or release-gate runs.