FHIR MCP Server, Compiled

The problem

Agents over raw FHIR cap around 50%

Healthcare-AI demos look magical on a tidy patient. On a real record, the ceiling shows up fast — and it's not a prompt you can engineer your way out of.

~50% is the best agent, not the average

On FHIR-AgentBench, the strongest agent reaches roughly 50% answer correctness over raw FHIR. That's the best-case ceiling on faithfully answering questions about a record — not "fails half the time at random."

A patient is ~3M tokens of JSON

A real longitudinal record is on the order of 3 million tokens of FHIR JSON. You can't paste it into a context window. Naive RAG over it shreds the references that give the data meaning.

Agents skip reference-chasing

FHIR meaning lives in the links — Observation → Encounter → Condition. Agents routinely stop short of following those chains, so they answer from fragments instead of the connected record.

Coded-value blindness

Agents filter on display text instead of the underlying codes. "Depression" the string is not the same query as the SNOMED CT / ICD code — and the difference is silent wrong answers.

On MedAgentBench the asymmetry is just as telling: agents handle reads ~85% but writes ~54%. Reading a record is hard; writing safely to one is harder. Both numbers are measured over raw FHIR.

The solution

Ship the MCP compiler, not the MCP

Every typed clinical function you write becomes three things automatically — a typed SDK method, an HTTP endpoint, and an MCP tool. One definition, three surfaces, generated schema and safety on all of them.

One function, three surfaces

Write a typed clinical function once. The compiler emits a typed SDK method, an HTTP endpoint, and an MCP tool — with a generated schema so the agent always knows the exact shape of inputs and outputs.

You pick the toolset per agent

Different apps expose different tools. Choose exactly which functions a given agent can see and call. A scheduling assistant and a clinical-summary agent don't get the same surface area.

Agents read clean projections

Tools return committed operational read models — notesByPatient, timeline, latestScores — not raw FHIR. Raw FHIR is an explicit escape hatch, not the default the agent stumbles through.

Citations + freshness baked in

Every result is cited to its source record, and every read carries the freshness lifecycle — the agent knows whether a projection is fresh on commit or still pending.

Safe by default — not safe by reminder

The dangerous parts of agentic clinical software are the defaults. bonfireDB makes the safe path the only path you have to write code for.

Every tool is patient- and tenant-scoped automatically — the agent can't reach across the boundary even if it tries.
Writes are propose-only. The agent drafts; a human approves. No silent mutation of a clinical record.
Every agent read is audited — who, what, when, on whose behalf.
Every result is cited to the source record it came from. No ungrounded answers.

agent.tools.ts

// You write the typed function...
export const summarizeRecentNotes = clinical.defineTool({
  name: "summarizeRecentNotes",
  input: { patientId: id, windowDays: number },
  scope: "patient",        // ABAC, enforced
  mode: "read",            // writes are propose-only
  handler: async (a) => readModels.notesByPatient(a),
})

// ...and it IS an MCP tool. No second file.
// typed SDK method + HTTP endpoint + MCP tool,
// generated schema · scope · citations · freshness.

Side by side

Same question. Different floor.

The agent isn't smarter on bonfireDB. It's working against clean, connected, scoped data instead of a 3M-token federation dump.

Agent over raw FHIR

Crawls a multi-megabyte Bundle, runs out of context
Filters on display strings, misses the coded match
Stops chasing references, answers from fragments
No citation — you can't tell where the answer came from
Can mutate the record directly if you let it write
Ceiling ~50% answer correctness, best case

Agent over bonfireDB

Reads clean projections — timeline, latestScores
Queries by code, not by display text
References pre-resolved in the read model
Every result cited to its source record
Writes are propose-only; a clinician approves
Patient/tenant scope enforced on every call

Reliability is the stack, not the model

Typed tools alone don't move the 50% ceiling. What moves it is three levers used together, measured — typed tools, a sandboxed code/SQL tool over the projections, and multi-turn.

Typed tools — the agent calls named, schema-checked functions instead of free-handing FHIR queries.
Sandboxed code/SQL tool — over the clean projections, so the agent can compute, join, and aggregate without touching raw PHI plumbing.
Multi-turn — the agent inspects, refines, and verifies across turns instead of answering in one shot.

Drop any one lever and the gains collapse. The combination is the point.

session-prep.ts

// Batteries-included default tool
const ctx = await clinical.agent.sessionPrep({
  patientId,
  windowDays: 90,
  include: ["recentNotes", "assessments", "tasks"],
})
// -> cited, permission-aware context
//    ready to hand to the model

Defaults + custom

Batteries included. Then build your own.

Start with the tools every outpatient agent needs. Define new ones the moment your workflow needs them — each one is instantly an MCP tool.

sessionPrep

Cited, permission-aware context for the next visit — recent notes, assessments, open tasks — over a window you choose.

semanticSearch

Meaning-based search across clinical text, patient- and tenant-scoped, with every hit cited. Semantic search →

listByPatient

Clean, queryable projections of a patient's records — the read model your list screens and agents both use.

your custom tools

Define a typed clinical function for your specialty. Schema, scope, citations, freshness, MCP exposure — all generated.

In code

Define a tool. It's already an MCP tool.

No separate server, no hand-written JSON schema, no glue. The canonical SDK is the agent surface.

clinical.ts

// Typed clinical functions — each is an SDK method,
// an HTTP endpoint, AND an MCP tool.

clinical.notes.create({ patientId, encounterId, text })
// returns the freshness lifecycle object

const ctx = await clinical.agent.sessionPrep({
  patientId, windowDays: 90,
  include: ["recentNotes", "assessments", "tasks"],
}) // cited, permission-aware

await clinical.fhir.export(patientId)
// clean FHIR R4 Bundle on demand — the escape hatch

Want a concrete agent target? The backend for an AI medical scribe shows how these scoped, cited tools wrap a real transcript-to-signed-note workflow end to end.

bonfireDB is early-stage and pre-launch. This page describes the product's design and positioning. The ~50% figure and reliability levers reflect published findings about agents over FHIR — best-agent answer correctness on FHIR-AgentBench, and read/write asymmetry on MedAgentBench — not a benchmark of bonfireDB itself. "FHIR" is a registered trademark of HL7®.

FAQ

Frequently asked questions

What is a FHIR MCP server and why not just use a canned one?

An MCP server exposes tools an LLM agent can call. bonfireDB is designed to compile every typed clinical function you write into an MCP tool — with schema, scope, citations, and freshness baked in — instead of shipping one fixed server. You choose which tools each agent sees, so a scheduling assistant and a clinical-summary agent get different surface areas.

Why do agents over raw FHIR cap around 50%?

Published findings (FHIR-AgentBench) show the strongest agent reaches roughly 50% answer correctness over raw FHIR — the best case, not the average. A real record is ~3M tokens of JSON, agents skip reference-chasing, and they filter on display text instead of codes. bonfireDB is designed to have agents read clean, scoped projections instead of a federation dump.

How does bonfireDB make agent-native clinical data safe by default?

Every tool is designed to be patient- and tenant-scoped automatically, writes are propose-only (the agent drafts, a human approves), every read is audited, and every result is cited to its source record. The safe path is the only path you write code for, rather than safety-by-reminder.

Do I have to write a separate MCP server file?

No. The design is one definition, three surfaces: write a typed clinical function once and the compiler emits a typed SDK method, an HTTP endpoint, and an MCP tool with a generated schema. The canonical SDK is the agent surface — no second file, no hand-written JSON schema, no glue.

What actually moves the reliability ceiling beyond the model?

Reliability is the stack, not the model. bonfireDB's design combines three levers used together: typed tools, a sandboxed code/SQL tool over clean projections, and multi-turn so the agent inspects and verifies. Drop any one and the gains collapse. Note: the ~50% figure reflects published FHIR agent research, not a benchmark of bonfireDB itself, which is pre-launch.

Build your own MCP. Don't ship a canned one.