Datadog LLM Observability¶
Datadog LLM Observability is whatifd’s fourth supported trace source, shipped in v0.3 as the optional whatifd-datadog package. whatifd reads previously-ingested LLM spans from the LLM Observability Export API and projects each trace into the same RawTrace shape every other adapter produces — so you can fork Datadog-traced agent turns, replay them under a proposed change, and gate a PR on the verdict.
whatifd is read-only on Datadog. It never writes spans back.
What whatifd reads from Datadog¶
For each selected span (via GET /api/v2/llm-obs/v1/spans/events):
input/output— the agent’s input and original output, returned by the Export API asSearchedIO({value, messages}). Wrapped asSensitive[str]at the boundary (cardinal #5);valueis preferred, falling back to concatenated messagecontent.trace_id+parent_id— group spans into traces and identify the root.span_kind(agent/workflow/llm/tool/ …) — identifies the root (orchestration kinds) and projects child spans, includingtoolspans, into the trace’s tool-call structure.tags/ other attributes — pass through toRawTrace.metadata; PII-registered keys are wrapped asSensitive[str].
Install¶
uv pip install whatifd whatifd-datadog
The adapter core is dependency-light and offline-testable. The Export-API HTTP client needs httpx, pulled by the [live] extra:
uv pip install "whatifd-datadog[live]"
The official
datadog-api-clientSDK does not expose a spans-read path, sowhatifd-datadogwraps the documented Export API with a thinhttpxclient.
Credentials¶
Read from the environment (never from config):
Variable |
Notes |
|---|---|
|
Datadog API key. |
|
Datadog Application key — the Export API requires both keys, not just the API key. |
|
Region; defaults to |
Wiring¶
DatadogTraceSource is span-iterator-shaped (like the Phoenix adapter): it takes a zero-arg spans_provider yielding normalized span dicts. whatifd_datadog.client.make_spans_provider builds one from the Export API:
import os
from whatifd_datadog import DatadogTraceSource
from whatifd_datadog.client import DatadogExportClient, make_spans_provider
client = DatadogExportClient(
api_key=os.environ["DD_API_KEY"],
app_key=os.environ["DD_APP_KEY"],
site=os.environ.get("DD_SITE", "datadoghq.com"),
)
source = DatadogTraceSource(
spans_provider=make_spans_provider(
client,
ml_app="my-agent",
from_ts="now-24h", # ALWAYS set a window — see below
),
cohort_classifier=lambda spans: (
"failure" if any("whatifd:failure" in (s.get("tags") or []) for s in spans)
else "baseline"
),
)
Warning
Always bound the time window. The Export API defaults to the last 15 minutes if no window is given. make_spans_provider requires an explicit from_ts (e.g. "now-24h") and raises ValueError otherwise — a forgotten window would silently yield a near-empty cohort.
From config (whatifd.config.yaml)¶
source:
adapter: datadog
dd_ml_app: my-agent
dd_from: now-24h # required for adapter: datadog
dd_to: now # optional, defaults to "now"
dd_query: "" # optional span-search filter
whatifd fork reads DD_API_KEY / DD_APP_KEY / DD_SITE from the environment, builds the client, and wires a default tag-based cohort classifier (whatifd:failure).
Statistical honesty¶
cluster_key_support() returns an empty tuple — whatifd does not mine Datadog session_id / trace_id as cluster keys (cardinal #10: no unannounced inferential commitments).
Sending verdict metrics back to Datadog¶
whatifd-datadog also ships a CI-side metrics emitter — the inverse direction. After whatifd fork writes its report, push the verdict + cohort metrics to Datadog so dashboards and monitors can track Ship-rate and regression trends:
whatifd fork --config whatifd.config.yaml
whatifd-datadog-emit reports/whatifd-fork-2026-06-04.json --tag service:my-agent
Emits gauges (agentless, POST /api/v1/series, needs DD_API_KEY):
whatifd.verdict.code—0=ship /1=dont_ship /2=inconclusive (matches the CLI exit code; alert on> 0).whatifd.cohort.{selected,replayed,scored,improved,regressed,unchanged,median_delta,ci_lower,ci_upper,floor_passed,regression_ratio,improvement_ratio}(taggedcohort:<name>).whatifd.findings.blocking.
The emitter is out of whatifd core (it only reads the already-written report) and soft-fails by default (exit 0 on emission error) so a metrics hiccup can never turn a green verdict red in CI; pass --strict to make emission gate, --dry-run to print without submitting.
Known limitation¶
Datadog tool spans carry input as a rendered string, not structured args, so whatifd-datadog leaves ToolSpan.args unpopulated — the use-original tool cache (#108) does not fill from Datadog traces. This matches the other adapters’ status; structured-args extraction is tracer-specific follow-up work.