Agents read raw FHIR at ~50%. We're fixing that, in the open.

The most-cited result in this space is a ceiling, not a feature: LLM agents over raw FHIR cap near 50% answer-correctness. bonfireDB's projection layer is built to break it — and we publish the method and the harness so you can re-run the result yourself.

The scoreboard

The metric that defines the category — tracked in the open

Agent answer-correctness over the clinical record is the number that decides whether AI-native health apps are reliable. We're standing up a public, reproducible scoreboard against it — the control reproduced first, our delta when it's measured. Whoever owns this metric owns the category; we'd rather earn it in the open than claim it.

Published raw-FHIR ceiling

~50% — FHIR-AgentBench's best-agent result over raw FHIR (published, cited). Our apples-to-apples control adds Medplum-MCP-as-shipped on the same data — that head-to-head is not yet measured.

bonfireDB agent layer

measuring — clean projections + sandboxed SQL + scoped, cited tools. The delta publishes with the harness, not before it's real.

Follow along. The seed script, task adapters, both tool sets, the judge prompt, and per-model logs ship together when the first number lands. A credentialed skeptic who re-runs it becomes a citation.

The ceiling

It's an architecture problem, not a model problem

FHIR-AgentBench (Verily / MIT / KAIST, 2,931 real clinical questions) found frontier models all cluster at 44–50% under the best agent architecture. A bigger model doesn't move it — the data interface does. That's exactly the layer bonfireDB builds.

Raw FHIR + a good agent

  • Doesn't chase references → answers with no drug name
  • Filters on display text, not the LOINC/SNOMED code
  • A single record is millions of tokens → context overflow
  • Cohort questions excluded entirely — retrieval explodes

bonfireDB agent layer

  • Reference-resolution: the agent gets the resolved fact
  • Code-resolution across LOINC/SNOMED/RxNorm
  • Clean projections — clinical context, not raw JSON
  • Aggregates made answerable, ABAC-scoped, under BAA
The method

Apples-to-apples, or it doesn't count

A benchmark is only proof if a skeptic can re-run it. Here's the protocol we hold ourselves to.

1

Only the tool layer changes

Same model, same tasks, same real MIMIC-IV-FHIR data, same token budget, same judge. The single variable is raw-FHIR access vs the bonfireDB layer.

2

A strong baseline, not a strawman

The control is raw FHIR + a competent agent + Medplum's MCP as shipped. We publish the baseline's tools. The claim is "beats Medplum-MCP-as-shipped," not "beats a crippled REST loop."

3

Held-out and multi-model

Tools are tuned on a train split and scored on a frozen held-out split, across ≥3 model families, sliced by failure-mode category — so the lift can't be benchmark-tuning.

4

Reproducible by anyone

Seed script, task adapters, both tool sets, the judge prompt, and per-model logs ship together. A skeptic who re-runs it becomes a citation.

Where this stands today: the ~50% figure is the published, peer-reviewed ceiling we're attacking. The bonfireDB before/after is a capability claim with an open methodology — not a measured number we're asking you to trust yet. When we run it, we publish the harness with it. We'd rather show you the method now than a number you can't reproduce.

Why this matters for what you'd build: the backend for an AI medical scribe is the kind of agent workflow this ceiling caps — and the projection layer is what's designed to lift it.

Follow the benchmark as we build it.

Become a design partner for early access and the eval drop.