Most LLM tutorials show structured output as a one-liner: pass a Pydantic model, get back validated JSON, ship it. In production with PHI on the line, that one-liner is the easy 20% of the problem. The other 80% is what happens when the schema validates but the data is still wrong, when the model returns a JSON that passes type checks but contains hallucinated content, or when a field that should be redacted slips through because the schema only knew about its shape.
I worked on a clinical documentation pipeline last year where the LLM was extracting structured data from doctor-patient conversations. The schema validated. The integration tests passed. Then a real visit transcript came in where the patient mentioned a family member's medication and the LLM helpfully attributed it to the patient. Schema-valid. Clinically wrong. PHI-adjacent enough to trigger a compliance review.
This post covers the five patterns we ended up adopting. The HIPAA context made these patterns mandatory for us, but they apply to any production LLM pipeline where downstream systems trust the output.
Why "valid JSON" is not enough
Native structured outputs from OpenAI, Anthropic, and Google now guarantee schema conformance at the generation level. The API physically cannot return JSON that violates your schema. This is a real improvement over the prompt-and-pray approach from 2023.
The problem is that schema conformance is a syntactic guarantee, not a semantic one. A schema like this is fine:
from pydantic import BaseModel, Field
from typing import Optional
class PatientMedication(BaseModel):
name: str = Field(..., min_length=1)
dosage: str
frequency: str
prescribed_for_patient: bool
The model will return JSON that fits this shape every time. What it does not guarantee is that name is actually a medication, that dosage is parseable, or that prescribed_for_patient reflects what was actually said. In healthcare, "valid JSON, wrong field assignment" is a clinical incident, not a parsing bug.
The patterns below assume you already use native structured outputs and Pydantic (or Zod for JS). They go past schema validation into the harder problem: making sure the structured output is also trustworthy.
Pattern 1: Constrain enums brutally, never accept freeform strings
The biggest single source of downstream bugs we saw was string fields that should have been enums. The model would return "twice daily", "2x per day", "BID", "every 12 hours" for the same input. All valid strings. All meaning the same thing. All breaking the billing system that expected one of five canonical values.
The fix is to type every categorical field as an enum and let the schema enforce the closed set:
from enum import Enum
class Frequency(str, Enum):
ONCE_DAILY = "once_daily"
TWICE_DAILY = "twice_daily"
THREE_TIMES_DAILY = "three_times_daily"
FOUR_TIMES_DAILY = "four_times_daily"
AS_NEEDED = "as_needed"
OTHER = "other"
class PatientMedication(BaseModel):
name: str = Field(..., min_length=1, max_length=200)
dosage: str = Field(..., max_length=100)
frequency: Frequency
frequency_raw: Optional[str] = Field(
None,
description="Original text from the transcript if frequency is OTHER"
)
Two things matter here. First, the enum is the closed set the downstream system knows how to handle. The model has to pick one. Second, there's an OTHER escape hatch with a frequency_raw field that captures the original text. This is the difference between "schema validation forced the model to lie" and "schema validation surfaced an edge case for human review."
We saw a measurable drop in downstream parsing errors after we converted every categorical field to this pattern. Roughly 8% of medication records had frequency strings the billing system couldn't parse before. After the enum migration, that dropped to under 0.5%, and the remaining cases all landed in OTHER with the raw text preserved.
Pattern 2: Confidence and source-attribution fields
The schema-valid-but-wrong problem comes mostly from the model being confidently wrong. The mitigation is to make the model state its own uncertainty inside the schema. This sounds soft. It is not. When the model has to output a confidence score and an attribution field for every claim, the failure modes change.
from typing import List, Literal
class Citation(BaseModel):
text_span: str = Field(..., description="Exact quote from transcript")
speaker: Literal["doctor", "patient", "other"]
class ExtractedFact(BaseModel):
field_name: str
value: str
confidence: float = Field(..., ge=0.0, le=1.0)
citations: List[Citation] = Field(..., min_length=1)
requires_human_review: bool
class VisitExtraction(BaseModel):
medications: List[ExtractedFact]
diagnoses: List[ExtractedFact]
follow_up_actions: List[ExtractedFact]
Three rules we ended up enforcing:
- Confidence is a float, not a category. Categorical confidence ("high", "medium", "low") gets gamed by the model into all-high.
- Citations are required. Every extracted fact must include the exact span from the source. If the model cannot cite, it should not extract. This is much harder to fake.
-
requires_human_reviewis computed by the model, not by you. Adding this field changes how the model reasons about its own outputs. We saw the model start flagging genuinely ambiguous cases that we had not anticipated.
The downstream system then routes anything with confidence < 0.8 or requires_human_review == True to a review queue. This is the same pattern Collin Wilkins describes in Structured outputs for real pipelines, and it generalizes well beyond healthcare.
Pattern 3: Post-validation passes that schemas cannot express
Some validations are impossible to express in JSON Schema. A medication dosage of "500mg" is a valid string. A dosage of "500000mg" is also a valid string, but it would kill someone. The schema cannot tell the difference.
We layered a post-validation pass on top of Pydantic that runs domain-specific checks:
from pydantic import BaseModel, field_validator
import re
class PatientMedication(BaseModel):
name: str
dosage_amount: float
dosage_unit: Literal["mg", "g", "mcg", "ml", "units"]
frequency: Frequency
@field_validator("dosage_amount")
@classmethod
def reasonable_dosage(cls, v: float, info) -> float:
if v <= 0 or v > 10000:
raise ValueError(
f"Dosage {v} outside reasonable range. "
"Flag for review."
)
return v
The validator does not just reject bad values, it raises a typed error that the pipeline catches and routes to the review queue with context. The model is not punished for surfacing the edge case; it is rewarded because the review queue captures the original output and the validator error together.
A second class of validations needs cross-field logic. For example, if route == "topical" then dosage_unit cannot be "units". These go in model_validator decorators rather than field_validator.
Pattern 4: PHI redaction as a separate, schema-enforced step
This is the pattern specific to HIPAA, but the principle generalizes to anything with sensitive data. The instinct is to ask the LLM to extract structured data and redact PHI in the same call. This is a mistake. The model is good at one or the other, not both at once.
We split into two passes with separate schemas:
# Pass 1: Extract structured clinical data, PHI preserved
class RawExtraction(BaseModel):
medications: List[PatientMedication]
diagnoses: List[Diagnosis]
raw_transcript: str
# Pass 2: De-identify, output PHI-free version
class PHISpan(BaseModel):
category: Literal[
"name", "date", "location", "phone",
"address", "mrn", "other"
]
text: str
start: int
end: int
class DeidentifiedOutput(BaseModel):
redacted_transcript: str
phi_spans_found: List[PHISpan]
safe_to_log: bool
The first pass runs in a HIPAA-controlled environment with a BAA in place. The second pass is the gate before anything leaves that environment, including before any logs or traces reach observability tools. This separation matters because research on LLM-based PHI annotation shows that combining extraction and de-identification in one prompt degrades both tasks significantly.
For organizations building this kind of pipeline, the AI medical scribe architecture guide covers the broader integration patterns including FHIR write-back and EHR boundaries. The point for this post is narrower: never let a single model call be responsible for both producing structured output and protecting sensitive data. The schemas have to be separate, and the second pass has to be the gate.
Pattern 5: Replay and diff outputs against schema versions
Schemas evolve. A field that was optional becomes required, an enum gains a new value, a confidence threshold tightens. In a regulated environment, you cannot just deploy a new schema version and lose the ability to reproduce older outputs.
We version schemas the way we version database migrations:
class SchemaVersion(BaseModel):
version: str # semver: "1.2.0"
deployed_at: datetime
breaking_changes: List[str]
class StoredExtraction(BaseModel):
schema_version: str
raw_llm_output: str # The JSON string before parsing
parsed: Dict
extraction_timestamp: datetime
model_id: str # "gpt-4o-2024-08-06"
def replay_with_current_schema(
self, current_schema_class
) -> tuple[Optional[BaseModel], List[str]]:
"""Try to parse stored output with current schema, return errors."""
try:
parsed = current_schema_class.model_validate_json(
self.raw_llm_output
)
return parsed, []
except ValidationError as e:
return None, [str(err) for err in e.errors()]
This sounds like over-engineering. It is until the day the compliance team asks you to produce all extractions that used a specific schema version because a downstream system had a bug. With this pattern, you store the raw output and can re-parse it against any historical schema. Without it, you have to re-run the LLM, which is expensive, non-deterministic, and arguably a compliance violation on its own (because you are now producing new PHI processing events for an audit).
Schema versioning also makes A/B testing schemas safe. You can deploy v1.2.0 to 10% of traffic, compare extraction quality against v1.1.0, and roll back without losing data. There's a good Dev.to thread on Pydantic patterns for production that covers some of this, though the regulated-environment angle is less common.
What this looks like in production
Putting the five patterns together, our extraction pipeline became:
- LLM call with native structured outputs and a versioned Pydantic schema (Pattern 1, Pattern 5)
- Pydantic validation catches schema violations (always passes thanks to native structured outputs)
- Post-validation pass with domain-specific range checks and cross-field logic (Pattern 3)
- Confidence and review-flag check routes uncertain extractions to human queue (Pattern 2)
- Separate de-identification pass with PHI schema before anything leaves the controlled environment (Pattern 4)
- Storage with raw output and schema version for replay (Pattern 5)
The cost is roughly 2x the latency and 1.5x the API spend per visit compared to a naive single-call extraction. The benefit is that the compliance team stopped asking "how do you know the output is correct" and started asking smaller, specific questions we could answer with audit trails.
If you are building anything similar, the order I would recommend: enums first (cheapest fix with the biggest quality jump), then confidence and citations (forces the model to reason about its own uncertainty), then post-validation, then schema versioning, then the de-identification split if you have a regulated environment. Doing them in this order means you ship value at every step rather than waiting until everything is in place.
Wrapping up
Strict schemas are necessary but not sufficient for production LLM outputs. The schema makes the JSON valid; the patterns above make the content trustworthy. In a HIPAA environment that distinction is the difference between shipping and getting blocked at compliance review. In other environments it is the difference between an LLM feature that ships once and one that survives a year of production traffic.
The patterns are not specific to healthcare. Replace "PHI" with "customer financial data" or "personal information" and the same five patterns apply. The constraint of working with PHI just made it cheaper for us to do this work because the alternative was much more expensive.
Top comments (0)