The most-cited result in this space is a ceiling, not a feature: LLM agents over raw FHIR cap near 50% answer-correctness. bonfireDB's projection layer is built to break it — and we publish the method and the harness so you can re-run the result yourself.
Agent answer-correctness over the clinical record is the number that decides whether AI-native health apps are reliable. We're standing up a public, reproducible scoreboard against it — the control reproduced first, our delta when it's measured. Whoever owns this metric owns the category; we'd rather earn it in the open than claim it.
~50% — FHIR-AgentBench's best-agent result over raw FHIR (published, cited). Our apples-to-apples control adds Medplum-MCP-as-shipped on the same data — that head-to-head is not yet measured.
measuring — clean projections + sandboxed SQL + scoped, cited tools. The delta publishes with the harness, not before it's real.
Follow along. The seed script, task adapters, both tool sets, the judge prompt, and per-model logs ship together when the first number lands. A credentialed skeptic who re-runs it becomes a citation.
FHIR-AgentBench (Verily / MIT / KAIST, 2,931 real clinical questions) found frontier models all cluster at 44–50% under the best agent architecture. A bigger model doesn't move it — the data interface does. That's exactly the layer bonfireDB builds.
A benchmark is only proof if a skeptic can re-run it. Here's the protocol we hold ourselves to.
Same model, same tasks, same real MIMIC-IV-FHIR data, same token budget, same judge. The single variable is raw-FHIR access vs the bonfireDB layer.
The control is raw FHIR + a competent agent + Medplum's MCP as shipped. We publish the baseline's tools. The claim is "beats Medplum-MCP-as-shipped," not "beats a crippled REST loop."
Tools are tuned on a train split and scored on a frozen held-out split, across ≥3 model families, sliced by failure-mode category — so the lift can't be benchmark-tuning.
Seed script, task adapters, both tool sets, the judge prompt, and per-model logs ship together. A skeptic who re-runs it becomes a citation.
Where this stands today: the ~50% figure is the published, peer-reviewed ceiling we're attacking. The bonfireDB before/after is a capability claim with an open methodology — not a measured number we're asking you to trust yet. When we run it, we publish the harness with it. We'd rather show you the method now than a number you can't reproduce.
Why this matters for what you'd build: the backend for an AI medical scribe is the kind of agent workflow this ceiling caps — and the projection layer is what's designed to lift it.
Become a design partner for early access and the eval drop.