We Tested 25 Local LLMs for Medical Use. Here’s What Shipped.

Vitaly Gariev / Unsplash Over the past year we’ve been building a medical AI assistant that turns doctor-patient audio into SOAP notes, ICD-10-GM codes, and billing entries at Meda AI. The common questions from doctors are: How about data privacy? Can you make sure our data is secure? On Premise deployment with 100% Local AI would solve a lot of these type of questions. Our constraint is simple: a hallucinated patient age, a wrong dosage, or a fabricated allergy is not a funny glitch. It’s a patient safety incident. So, how do you actually pick a local LLM for clinical work? This post walks through the seven questions we asked ourselves, in the order we asked them, and the answers we landed on after five days of benchmarking. This post is not about a standardized benchmark, we mainly work together with AI agent to analyze, and check with our real data to get a grasp/feel on which model is the best for our use case. TL;DR Yes, the current stage of LLM models enables you to deploy your Medical AI assistant locally. M3 Ultra will give you 100+ tok/s with the right settings, don’t be fixated on MLX models as newer GGUF model might give you better quality with acceptable speed degradation. Test your model independently for each pipeline, and it is recommended to use non thinking model because the reasoning cost is not worth the speed degradation. I. Is Medical AI Ready for 100% Local Deployment? Short answer: yes, but there’s no one-fits-all solution. Three realities made us confident that a fully on-premise stack is viable today: Hardware caught up. A Mac Studio M3 Ultra with 96 GB unified memory can comfortably host several production models with room to spare. Small-to-midsize models got good. A small dense model with 8–12B weights hits zero-hallucination rates on structured text extraction, and MoE models can delivers cloud-grade clinical reasoning on a single workstation. On-Premise setup makes GDPR easier. For a German GP practice, “just send it to OpenAI” is not a legal option, so the question is less “is local ready?” and more “which local stack holds up?” The caveat: a local LLM is not a drop-in replacement for a clinician’s judgment. We’ll come back to this in section VII when we talk about the two-layer architecture. II. Which Hardware and Which Models Are Suitable? For hardware, we tested on these 2 machines: Mac Studio M3 Ultra — 96 GB unified memory. Our primary target. Great for multi-model serving because memory is shared across models. What we found out: Memory bandwidth is the decode wall, not compute. The M3 Ultra’s ~800 GB/s memory bandwidth caps Gemma 12B decode at ~56–70 tok/s. We tried several optimization-flag combinations (KV-cache quantization, prefill chunk size, GPU memory utilization fraction, raw MLX vs mlx_lm vs rapid-mlx vs LM Studio) — all four frameworks land within ±2 tok/s of each other . There is no software fix; the ceiling is silicon. Concurrency reclaims most of the loss. Three concurrent requests on rapid-mlx push system throughput to 110 tok/s (1.83× wall-clock speedup), and 3× model instances at parallel=8 reach 205 tok/s for Orchestrator-8B and 517–667 tok/s for LFM2-8B. So the M3 is slow per-request but scales acceptably across multiple doctors using the box at once. VLM (MLX) models are second-class citizens (for now). LM Studio enforces parallel=1 on every Vision-Language Model, including the otherwise-excellent Gemma 4 E4B MLX build. Under 3× concurrent load it succeeds on only 2 out of 6 requests. For real concurrency you need a text-only checkpoint or a GGUF build via llama.cpp. Threadripper 7980X + NVIDIA 5090 (32GB VRAM). The non concurrent setup. Great for single model comparison, lacking RAM for multi model / concurrent load. What we found out: Performance is genuinely viable for single-doctor deployments. Gemma 12B at 116 tok/s on llama.cpp matches or beats most cloud APIs on a single-stream basis, with no per-token cost. The 5090’s 32 GB VRAM caps model size more aggressively than the M3 Ultra’s 96 GB unified pool. A Threadripper deployment can host one quality-critical model on the GPU and offload the rest to DDR5 — a different tradeoff than the Mac Studio’s “fit everything in unified memory” model. On the cost vs (acceptable) performance ratio, Mac Studio M3 wins. III. Which Medical AI Cases did We Test? Before benchmarking anything, we broke the problem into three stages. Mixing these into a single prompt is the main mistake we saw other teams make. Stage 1 — Transcript → SOAP/Summary. The scribe layer. Extract what was said. Nothing more. Stage 2 — SOAP → ICD-10-GM. The clinical reasoning layer. Stage 3 — Medical Context → billing codes. The biller/revenue optimizing layer. Each stage demands a different model behavior: The scribe must be neutral (a transcriptionist, not a doctor). The coder must reason clinically (a medical coder, not a transcriptionist). The biller must be fast and structured (a technical form-filler). For test data, we used four transcript lengths to expose different failure modes: 2K characters — a simple sinusitis visit. Tests baseline extraction. 5K characters — a diabetes follow-up. Tests lab value handling. 10K characters — a disc herniation workup. Tests anatomical vocabulary. 15K characters — a polytrauma case with multiple comorbidities and a drug allergy. Tests long-context recall and drug-interaction flagging. The polytrauma transcript was our main filter. If a model dropped the medical context at 15K, it didn’t make the next round. IV. How Many Models Did We Try, and Which Were Disqualified Instantly? We screened candidates against four hard constraints before any benchmarking: Fit in under ~25 GB of memory each. This lets us run 3+ models concurrently in the 96 GB unified pool with headroom for KV cache, the OS, Whisper, and supporting services. This limitation also enables us to test on the 5090 GPU. Strong German-language pretraining. Clinical transcripts are in German, with regional medical vocabulary and compound terms (“Herzinsuffizienz”, “Pankreatitis”). Models trained primarily on English data fabricated German medical words that don’t exist (we saw “Alltagsabfälligkeit”, “Diabtemetabolismus”, “Nasenschnaufe”). Licensed for commercial on-prem use. Research-only or non-commercial licenses are non-starters for a production clinical deployment. OpenAI-compatible serving. LM Studio (MLX), llama.cpp (GGUF), or vLLM — all expose /v1/chat/completions, which lets us swap models without rewriting service code. That narrows it down to candidates spanning Gemma 3 / Gemma 4, Qwen 3 / Qwen 3.5 / Qwen3–30B A3B MoE, LFM2 8B and 24B MoE, OLMo 3, Falcon H1R, Teuken, Apertus, Dolphin 3, GPT-OSS 20B, ERNIE 4.5 21B A3B, Trinity Mini 32B, GroveMOE, and the NVIDIA Orchestrator family. We tested 25 models in total. Eight were cut in the first pass. Here’s the tour of rejects — worth reading because each one fails in a characteristic way you’ll see in other models too. Teuken 7B — fast at 95 tok/s, then entered an infinite repetition loop. Also fabricated alcohol history. OLMo 3 7B — fastest SOAP generation, but produced medical words that don’t exist. Falcon H1R 7B — spent every token on an internal thinking monologue. Zero SOAP output. Every time. Apertus 8B — repeated the same sentence five times verbatim in the Subjective section. Dolphin 3.0 8B — invented patient ages. Omitted allergies at longer contexts. LFM2 24B-A2B — hallucinated an anatomical structure. Disqualifying, full stop. Gemma 3 27B QAT — the best raw quality we saw, but OOM-crashed on 96 GB under concurrent load. One doctor works; two doctors crash the box. Gemma 3 4B VLM — invented ages repeatedly. Repeatable hallucination. Pattern to watch for: speed leaders are often quality losers , and “thinking mode” models often produce no visible output unless configured correctly . Three softer criteria turned out to predict production fitness more than raw benchmark scores: Architecture without a thinking-mode tax. Models that route output through a chain (Qwen 3.5 9B, Falcon H1R, Nvidia Orchestrator) spend 38–82% of their tokens on internal reasoning before emitting any JSON. For temperature-0 structured extraction this is pure overhead — Qwen 3.5 9B in thinking mode produces a 68-second wall time for ~400 tokens of usable output. Models without thinking or with a working /nothink switch (Orchestrator, Gemma 3, Gemma 4, GPT-OSS 20B, LFM2) are the only viable choices on the hot path. Sparse activation (MoE) for memory efficiency. Mixture-of-Experts models like LFM2–8B-A1B (1 B active of 8 B), Qwen3–30B A3B (~3 B active of 30 B), and GPT-OSS 20B trade a higher disk footprint for much lower per-token compute. Qwen3–30B A3B fits in ~18 GB and decodes at MoE-class speed despite its 30 B nominal size — that’s the difference between “one quality model and you’re full” and “three models concurrent.” Runtime portability. MLX is fast on Apple Silicon, but the runtime is brittle: mlx_vlm 0.4.0 had no gemma4 module, and LM Studio 0.0.9-1's daemon was missing the qwen3_moe architecture entirely. Both blockers vanished the moment we switched to GGUF + llama.cpp. A model is only as deployable as its weakest runtime — pick architectures that have clean GGUF releases. V. Summary of Initial Findings — How Did We Narrow Further? After the mass casualty round, we ranked the survivors against three axes: hallucination rate, tokens-per-second under concurrency, and memory footprint. Five standouts emerged in the first sweep: LFM2 8B-A1B: Super fast (136–204 tok/s). Good candidates for scribe, since it did not invent new stuffs, but it skips Objective section sometimes. Gemma 3 12B QAT: Second best option after Gemma 3 27B QAT. Still survive concurrency with some quality degradation (no hallucination) Qwen Family: NVIDIA Orchestrator 8B, Qwen 3.5 9B are great candidates, as long as we can turn off the thinking. But two of the most promising models didn’t show up in this table — because they wouldn’t even load on MLX. We focus on the speed and MLX optimization, and initially discard other models that doesn’t have MLX variant. When we’re open to other options, that’s where the story gets interesting. The Runtime Pivot: MLX → GGUF Two models we really wanted to test failed at the runtime layer, not the quality layer: Gemma 4 E4B (MLX 4-bit) — Our default model is Gemma 3, so when Gemma 4 is release we want to try this right away. However mlx_vlm 0.4.0in our LM Studio build had no gemma4 module. The model was also a Vision-Language Model, which meant even after patching the runtime, LM Studio enforced parallel=1 — we got 2 out of 6 succeed at 3x concurrency . Unusable in production. Qwen3–30B A3B (MLX) — LM Studio 0.0.9–1’s daemon was missing the qwen3_moe architecture. The model files were on disk but never discovered. The fix was a tradeoff that we’re willing to take: drop MLX, use GGUF via llama.cpp . Both models have clean GGUF releases and we can accept the drop in performance for this test. Once we switched runtimes, the ranking changed meaningfully: Gemma 4 E4B — 5.0 GB on disk, 128K context window, ~12s per SOAP at 3x concurrency. Crucially, its medical vocabulary is visibly better than Gemma 3 4B, and the 128K window means we can hand it a full transcript and a full billing prompt in the same context — so it can serve both Stage 1 (SOAP) and Stage 3 (billing) from a single loaded model slot. Qwen3–30B A3B (Non Thinking) — a ~18 GB MoE with ~3B active parameters per token. On our ICD coding suite it produced the strongest clinical reasoning we’ve seen from anything that fits on one workstation, with zero thinking-mode. We can’t afford the latency increase for thinking model on this ICD coding pipeline. Qwen 3.5 is brilliant but the additional wall time (60–70% of token produced is used for thinking) is not worth the small quality gain for this pipeline. With those 2 additional GGUF models, we’re ready to have a final comparison. VI. How Do We Measure Quality for Each Pipeline Stage? This is the part where we test for hallucination and medical quality based on our experiences. Each stage of the pipeline gets a different metric. 1. Stage 1 (SOAP extraction) — zero-hallucination rate We use an LLM-as-judge grader that inspects every SOAP field against the source transcript and flags any sentence not grounded in the transcript. We measure: Fabrication rate — % of SOAP items not supported by the transcript. Omission rate — % of transcript-present clinical facts missing from the SOAP. Structural completeness — does the note have all four S/O/A/P sections? Example finding: Gemma 4 E4B GGUF under a “secretarial” system prompt at temperature 0.0 hits 0.0% fabrication on our 21-scenario battery, with 96% keyword recall on complex cases. The same model with a generic “summarize this” prompt fabricates ages in ~12% of runs. The prompt is the primary hallucination lever. The other nice property: Gemma 4’s 128K context window means we don’t have to chunk the polytrauma transcripts. Every earlier 4B model we tested degraded at ≥10K tokens; Gemma 4 doesn’t. Stage 1b (summary for patient) Aside from SOAP, we also provide a short summary that the doctor can send to patient. The short patient summary doesn’t need 30B of reasoning. Gemma 3 4B text-only MLX (our previous scribe candidate) is perfect for this — it’s fast, it copies facts literally without editorializing, and its 2.6 GB footprint costs almost nothing. We measure: Fact fidelity — does the summary contradict the SOAP? Length compliance — is it within the target 3–5 sentences? Readability — Flesch-score check against a clinic-readable baseline. We don’t measure medical quality because this is supposed to be a high level summary for patient with no medical jargons included. Gemma 3 4B text wins this role cleanly, mainly due to its speed. It’s the rare case where “keep the old model around” is the right answer. Since it has MLX version, the performance is faster then Gemma 4 E4B. 2. Stage 2 (ICD coding) — Jaccard + category recall ICD-10-GM coding gets two metrics because exact-code matching is too strict and category matching is too loose: Exact Jaccard — set intersection of predicted vs. reference ICD codes. Category recall — recall at the ICD-10 category level (e.g., E11.* counts if reference is E11.9). Example from our ICD runs against synthetic data based on our experience with real patient records. For this pipeline, the final contenders are Qwen3–30B A3B vs GPT-OSS 20B vs Gemma 12B QAT vs Orchestrator 8B Qwen3–30B A3B’s biggest wins: No secondary-code hallucinations. GPT-OSS 20B occasionally emitted E87.2 (acidosis) on endocrine cases and R05 (cough) on almost anything. Orchestrator did the same. Qwen3–30B doesn’t. Correct Z-codes. GPT-OSS 20B systematically missed Z00.0, Z02.1, Z09.8 and coded symptoms instead. Qwen3–30B handles Z-codes cleanly. ICD-10-GM, not ICD-10-CM. GPT-OSS 20B would occasionally slip into the US coding set. Qwen3–30B stays in the German catalog. Performance and concurrency : with MoE architecture and non thinking mode, Qwen3–30B is the fastest model out of the 4 Stage 2b — the prescribing trap suite We wrote 10 synthetic clinical scenarios with hidden prescribing errors: Aspirin monotherapy for Atrial Fibrillation (insufficient for AF stroke prevention). Ibuprofen prescribed post-ACS (contraindicated with DAPT). Metformin prescribed with CKD-4 (contraindicated below GFR 30). We score models on whether Stage 2 flags these errors. Qwen3-30B A3B catches the full spectrum — including the compound delirium scenario (QTc prolongation + anticholinergic + SSRI-induced hyponatremia + opioid-BZD) that broke simpler coders. Notes: Stage 1 is designed to NOT flag these errors. The scribe’s job is to record what was said. When we gave the safety traps to Orchestrator-as-scribe, it actively endorsed the wrong prescriptions. That’s worse than missing them. Gemma 4 as our new scribe stays neutral — it transcribes the doctor’s prescription as-said, without editorializing, and leaves the safety call to Qwen3–30B. 3. Stage 3 (billing) — multiple compliance check Billing codes either parse against the EBM/GOÄ/HZV grammar or they don’t. We measure: Schema compliance — % of outputs that validate against the billing JSON schema. Code legality — % of emitted codes that exist in the current catalog. Here’s where Gemma 4 earns its second slot: we reuse the same loaded model for billing that serves Stage 1 SOAP, because billing is also a structured-extraction task and Gemma 4 hits full schema compliance on our EBM/GOÄ/HZV templates. One model, two stages, one slot of memory. VII. What’s the Final Setup, and What Should You Consider? Here’s what we’re running in production today. Our setup loads four models at boot, each with the parallelism level that matches its role: Transcript (speech-to-text output) │ ├──── Stage 1 (parallel) ────────────────────────── │ Transcript → SOAP: gemma-4-e4b-it │ parallel=8 · ctx=32768 · 128K architecture │ Main throughput model — shared with Stage 3 │ │ Summary: gemma-3-text-4b-it / gemma-4-e4b-it │ parallel=4 · ctx=24576 · literal fact copy │ Small, fast, non-editorializing │ SOAP JSON │ ├──── Stage 2 ───────────────────────────────────── │ SOAP → ICD: Qwen3-30B-A3B │ parallel=4 · ctx=16384 · MoE (≈3B active) │ Strong Z-codes, no secondary hallucinations │ ICD codes + klinische_warnhinweise │ ├──── Stage 3 ───────────────────────────────────── │ Billing: gemma-4-e4b-it (shared with Stage 1) │ Same loaded slot — structured extraction reuse │ └──── Document/OCR processing ─────────────────────────── gemma-3-12b-it-qat parallel=1 · ctx=16384 Four models, one slot each, ~33 GB of VRAM total (roughly 35% of the 96 GB M3 Ultra). That still leaves comfortable headroom for Whisper, the NER service, and other small models. Two design choices worth calling out: Gemma 4 serves two stages from one slot. Both SOAP extraction and billing are structured-JSON tasks. Sharing the same loaded model between Stage 1 and Stage 3 saves ~5 GB and avoids KV-cache thrash. Gemma 3 12B QAT stays loaded at parallel=1. We use this for our document understanding with OCR, and also acts as fallback in some edge cases. Considerations if you’re building something similar: The multi-layer split is mandatory. Not every step needs the same model. Don’t marry yourself to one runtime. MLX is fast on Apple Silicon, but we only got to our final stack by leaving MLX for both production models. GGUF + llama.cpp is the universal fallback and often the only working path when a new architecture ships. Budget time to test both. Not all hallucinations are equal. A word-level quantization artifact (inventing age, wrong numbers) is different from a fabricated anatomical structure. Grade by type , not just rate. Fastest isn’t best for clinical use. LFM2 at 190+ tok/s is impressive, but it also omits the Objective section. Speed wins for draft summaries; it loses for clinical notes and ICD codes. Thinking mode is a tradeoff, not an upgrade. For our current use case, thinking mode is pure overhead. One of the quiet wins of Qwen3–30B A3B is that it doesn’t require an additional time for thinking process. Share model slots where you can. Gemma 4 serves both SOAP and billing because both are structured-extraction tasks with similar prompt shapes. One model loaded, two stages served, less fragmentation. That’s the whole stack. If you’re running local LLMs for medical documentation, hopefully this saves you a few weeks of benchmarking — and a few days of wondering why your shiny new MLX model refuses to load. Have fun! We Tested 25 Local LLMs for Medical Use. Here’s What Shipped. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read Original Article →

Source

https://pub.towardsai.net/we-tested-25-local-llms-for-medical-use-heres-what-shipped-cbad528f924b?source=rss----98111c9905da---4