Raihan

Posted on May 12

Where small models beat frontier LLMs (and where they don't): a 125M PHI detector

#ai #healthcare #machinelearning #python

Last month I published a 184M-parameter intent classifier that matches frontier LLMs at 22× lower latency. The story was clean: small specialized model, narrow task, comparable accuracy, much faster, almost free per inference. People liked it.

The second model in the ClarioScope SLM Suite tells a more complicated story. It's a PHI detector — a token classifier that tags spans of protected health information in inbound patient text across all 18 HIPAA Safe Harbor identifier categories. On the macro-F1 headline number, it loses to Claude Sonnet 4.6: 0.63 vs 0.89. On Claude Haiku 4.5: 0.63 vs 0.85. On GPT-4o: 0.63 vs 0.81.

So the click-through headline isn't "matches frontier." It's: on aggregate, frontier wins. But the macro number hides what's actually happening, and the per-entity breakdown reveals something more interesting than either "small model wins" or "small model loses."

Model on Hugging Face: raihan-js/clarioscope-phi-deberta-v1.

The 125M-parameter fine-tune beats or matches every frontier model on linguistic entities — geographic locations, ages, person names, dates, phone numbers, fax numbers, IP addresses. It loses badly on structured-ID entities — MRNs, license numbers, health-plan IDs, device serial numbers. The right production architecture is not one or the other. It's hybrid. This post is the methodology, the benchmark, and the honest interpretation.

The task: span detection across all 18 HIPAA Safe Harbor categories

Patient text comes in messy — "Hi Dr. Okafor, this is Iniko Adeleke, DOB 11/03/1985, MRN OMK-44291, phone 312.555.7820, my partner's email is jordan.holloway@workmail.io. We live in Brookline." A PHI detector has to locate every individually-identifying span: two names, a date, an MRN, a phone, an email, a location. Six spans, six different entity types, in a single 25-token message.

The HIPAA Safe Harbor rule defines 18 categories of PHI. The Clarioscope model tags all of them: NAME, LOC, DATE, PHONE, FAX, EMAIL, SSN, MRN, HEALTH_PLAN, ACCOUNT, LICENSE, VEHICLE, DEVICE, URL, IP, BIOMETRIC, PHOTO_REF, AGE_OVER_89. The architecture is standard: RoBERTa-base encoder, a token-classification head outputting BIO tags across 37 labels (one O plus 18 entity types times {B-, I-}).

A side note before going further: the repo is named clarioscope-phi-deberta-v1 because the original plan was DeBERTa-v3-base. During training, DeBERTa-v3 reproduced a NaN-gradient bug specific to this 37-label token classification setup — forward pass loss healthy, backward pass NaN on the first step, across fp16, bf16, and fp32, with explicit classifier head re-init and gradient clipping. After three afternoons of trying to keep DeBERTa alive, I switched to RoBERTa-base, which trained stably with the same training script. The repo name is kept for URL stability and the model card calls it out at the top.

Why not just call the API

The same three reasons as last time, with slightly different weights:

Privacy. PHI is the canonical "you'd want this to never leave your infrastructure" data class. A frontier API with a Business Associate Agreement is one option, but BAAs aren't free, aren't available at every tier, and add legal complexity. A self-hosted model never sends the patient's address or DOB to a third party.

Latency. The fine-tuned model runs in 28.6 ms on a CPU. Frontier API calls from my Bangladesh ISP run 1,000–2,000 ms. For redaction-before-routing where every inbound message has to be processed before it can be displayed in an inbox, that wall-clock floor matters.

Cost. Claude Sonnet 4.6 with the PHI-extraction prompt costs $2.53 per 1,000 inferences. Haiku is $1.00 per 1K. GPT-4o is $1.64 per 1K. The fine-tuned model is $0 per inference after training. For a practice receiving 10K messages per day, the math is the same as last time: $3,650–$9,234 per year on frontier vs roughly free on the local model.

The benchmark

The headline numbers, on a 548-example held-out test set, with entity-level F1 measured by seqeval (which requires both entity type AND exact span boundary to match for a true positive):

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-phi-deberta-v1` (CPU)	0.6301	0.7639	28.6 ms	$0.00
`claude-haiku-4-5-20251001`	0.8492	0.9213	1294 ms	$1.00
`claude-sonnet-4-6`	0.8946	0.9396	1980 ms	$2.53
`gpt-4o-2024-11-20`	0.8094	0.8912	1111 ms	$1.64

If you stop reading here, the takeaway is: "frontier wins, small model is 45× faster but trails 20–25 points of F1." That's true and it's the honest aggregate. But it's not the interesting part.

The interesting part: per-entity F1

Same benchmark, broken out by entity type, sorted by the fine-tuned model's F1 (best to worst):

Linguistic entities — small model matches or beats frontier:

Entity	This model	Haiku	Sonnet	GPT-4o
`PHONE`	0.983	1.000	0.994	1.000
`AGE_OVER_89`	0.976	0.967	0.967	0.836
`NAME`	0.961	0.996	0.994	0.980
`IP`	0.949	1.000	1.000	0.967
`FAX`	0.949	1.000	0.984	1.000
`DATE`	0.945	0.949	0.970	0.909
`LOC`	0.818	0.328	0.289	0.301

LOC is the standout. The fine-tuned model nearly triples the frontier APIs' F1 on geographic locations. Frontier models systematically under-flag informal location mentions like "she lives in Allston" or "at the Roxbury location" — their pretraining seems to have left them uncertain about whether informal context cues count as PHI. A specialized model trained explicitly to tag these does not hesitate.

AGE_OVER_89 is another quiet win. Frontier models occasionally tag ages 89-and-under as PHI (they aren't, under Safe Harbor) or miss the "over 89" qualifier ("she's 96") that determines whether the age is reportable. The fine-tuned model learned the rule directly from the training distribution.

For names, dates, phone numbers, fax numbers, and IPs, the gap between this model and frontier is 1–5 percentage points. Within margin-of-noise for production use.

Structured-ID entities — frontier wins, often dominantly:

Entity	This model	Haiku	Sonnet	GPT-4o
`MRN`	0.276	1.000	1.000	0.997
`LICENSE`	0.170	1.000	1.000	0.933
`HEALTH_PLAN`	0.264	0.855	0.983	0.717
`BIOMETRIC`	0.095	0.410	1.000	0.314
`DEVICE`	0.341	0.732	1.000	0.800
`VEHICLE`	0.640	1.000	1.000	0.970
`SSN`	0.583	0.983	0.949	0.915
`ACCOUNT`	0.759	0.985	0.969	1.000
`EMAIL`	0.815	1.000	1.000	1.000
`URL`	0.738	0.967	0.967	0.931

Frontier wins these by enormous margins. The fine-tuned model gets MRN right 28% of the time. Haiku and Sonnet get it right 100% of the time.

The reason is straightforward once you stare at the data. Structured-ID entities follow surface conventions that vary wildly between institutions and generators. An MRN might look like OMK-44291, RMR-882034, DENT-12345-A, or just 8472301. The training set generates one distribution of ID formats; the test set was generated by a different model and uses a different distribution. The fine-tuned model can only recognize what it saw during training, and when the test-set MRN doesn't match the training conventions, the model either misses it entirely or produces a span boundary that's off by one token (which under seqeval's strict matching is a miss).

Frontier models win these categories because they've seen a much wider distribution of ID formats during pretraining, and because their attention mechanism is strong enough to anchor an ID span to its context cue — "MRN" or "member ID" or "license #" — regardless of the specific token pattern that follows it.

This is a real limitation. It's also a reasonable one to live with, because there's a much cheaper way to catch structured-ID PHI than running a frontier API on every message.

The bug that wasn't (and the bug that was)

A first version of the model trained on the raw generated annotations and scored 0.57 macro F1 on test — worse than what's shipping now. The expected explanation was distribution shift between train and test. The actual explanation was simpler and more embarrassing.

The training data had systematic label noise: the data-generation LLM was returning entity texts that included the cue word that introduced the entity. The annotation for "MRN 8472301" came back as {"text": "MRN 8472301", "label": "MRN"} instead of {"text": "8472301"}. The literal word "SSN" was annotated as an SSN entity in six different training examples. About 8.6% of all training spans (1,676 of them) had this kind of cue-word contamination. The Claude-generated test set, with a stricter prompt, had two such cases out of 1,632 spans.

So the model wasn't being graded against the same distribution it learned. It learned "MRN" was part of an MRN span; the test set told it it wasn't.

A clean_data.py script in the repo strips known cue-word prefixes ("MRN ", "SSN ", "phone ", "account #") from entity texts, re-locates the cleaned text in the inquiry, and drops entities that no longer have a valid span. Importantly, it preserves natural prefix characters like the opening ( in (617) 555-1234 — a first version of the cleanup stripped parens too aggressively and tanked PHONE F1 from 0.94 to 0.51 in one run. The fix was to apply punctuation stripping only after a cue word had been detected and removed, not as a generic "trim leading punctuation" pass.

The cleanup recovered about 4 percentage points of test macro F1 and (more interestingly) flipped two entity types from "loses badly" to "competitive": EMAIL went from 0.55 to 0.82, ACCOUNT from 0.21 to 0.76.

The deeper lesson is one that's well-known but easy to forget: synthetic data is fast and cheap and the cost shows up as systematic noise that nobody else will catch for you. The annotations themselves need a QA pass before training. Real-world data has its own noise problems, but it tends not to label cue words as entities, because human annotators don't.

Preventing benchmark leakage

Same trick as model 1: training set was generated by gpt-4o-mini-2024-07-18; the held-out test set was generated by Claude with a deliberately different prompt style. This cross-generator split mitigates the failure mode where a fine-tuned model just learns one generator's style and the benchmark inflates.

A side effect on this model specifically: the Claude-generated test set uses tighter, more uniform structured-ID formats than the training set. That's part of why the test F1 on structured-ID entities is harsher than the val F1 (val: 0.86 macro; test: 0.63 macro). It's also fair, because real-world MRN formats are at least as varied as the gap between the two generators — and probably more so.

The recommended production architecture: hybrid

Given the per-entity breakdown, the right architecture is not "use this model alone" or "use a frontier API alone." It's a three-stage pipeline:

Run this model first on every inbound message. ~30 ms on a CPU, $0, never sends text off-host. Captures NAME, LOC, DATE, PHONE, FAX, IP, AGE_OVER_89 reliably — that's most of the volume.
Add regex matchers for highly structured patterns the model misses: SSN (\d{3}-\d{2}-\d{4}), credit-card numbers, basic MRN/account patterns specific to your practice's conventions. Regex is fast, free, and brittle — but correct when it matches.
Fall back to a frontier API only when the message contains likely structured-ID content the local pipeline didn't resolve, or when downstream confidence is needed. This pays the latency and dollar cost only on a small fraction of traffic.

For a practice receiving 10,000 messages per day where roughly 10% have unresolved structured-ID content after stages 1 and 2, the hybrid sends 1,000 calls per day to Haiku ($0.30/day) instead of 10,000 ($3.00/day). Most messages never leave the host. Latency is bounded by the local model. The frontier model becomes a "structured-ID specialist" rather than a "PHI redaction generalist."

This is the actual cost-effective answer in 2026. The small model is not a frontier replacement; it's a frontier accelerator.

The cost ledger

Item	Cost
9,500 synthetic training examples via OpenAI (`gpt-4o-mini`)	~$1.40
RunPod RTX A4000 pod (two training runs, ~25 min total)	~$1.50
Benchmark API calls (Haiku + Sonnet + GPT-4o on 548 examples × 2 runs)	~$5.20
Hugging Face hosting	$0
Total	~$8.10

A bit more expensive than model 1 because the benchmark ran twice — once before the data cleanup, once after. The cleanup story was a four-percentage-point F1 improvement, which seems small but matters for the "where does this model lose" interpretation of the per-entity numbers.

How to use it

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-phi-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = "Hi Dr. Okafor, this is Iniko Adeleke, DOB 11/03/1985. phone 312.555.7820, email jordan@workmail.io. I live in Brookline."

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type})
        i = j
    else:
        i += 1

for s in spans:
    print(s)
# {'text': 'Dr. Okafor', 'label': 'NAME'}
# {'text': 'Iniko Adeleke', 'label': 'NAME'}
# {'text': '11/03/1985', 'label': 'DATE'}
# {'text': '312.555.7820', 'label': 'PHONE'}
# {'text': 'jordan@workmail.io', 'label': 'EMAIL'}
# {'text': 'Brookline', 'label': 'LOC'}

Limitations

The model card has the full list. The ones worth surfacing here:

All training and evaluation data is synthetic. No real production validation yet. A real-world calibration pass is required before deployment.
Structured-ID entities are weak. Per the benchmark, this model is materially worse than frontier APIs on MRN, LICENSE, HEALTH_PLAN, BIOMETRIC, and several others. Pair with regex and/or a frontier fallback.
Not a HIPAA compliance verdict. This model tags entity types as defined in the Safe Harbor rule. HIPAA compliance is a regulatory determination that a model can't make on its own.
English only, healthcare practice domain only.

What's next

This is model 2 of three. The third is clarioscope-insurance-v1 — structured JSON extraction of insurance- and billing-relevant fields from inbound text. Probably a small encoder-decoder with constrained decoding. When all three are published, they'll go up as a Hugging Face collection with a single longer post tying the suite together.

The honest takeaway from this one: small specialized models don't always beat frontier on aggregate, but the per-entity breakdown is where the actual production decision lives. Frontier-or-nothing is the wrong frame. Frontier-as-fallback-to-a-cheap-local-model is the right one.

Follow along on Hugging Face or GitHub.

DEV Community