
I hit a wall, any help would be appreciated! Vet ER scribe — frontier models nail it, local LLMs are inconsistent. Model problem, methodology problem, or training problem?
First, my apologies if this is the wrong sub for this. I am a long-time lurker, but the truth is, a lot of this is over my head, but I am trying/learning. If it helps, this is a picture of my front end with an explanation to follow. Yes, the vast majority of this is vibe coded. Please limit the hate 😉. I am proud of it, I created something I actually use every night.
I'm an emergency vet who built a custom dictation/SOAP scribe for my own use. Workflow:
- Record dictation on my phone (PWA in the browser)
- Audio uploads to Firebase Storage; Whisper transcribes
- Transcript + a system prompt loaded from a single markdown file get sent to the model
- Model returns structured JSON → app renders five SOAP sections (History / PE / Assessment / Plan / Discharge)
- Output saved to Drive as markdown, copy-pastes into our PIM as either rich text (one hospital) or raw markdown (the other), and gets printed for paper records
The load-bearing piece is the markdown file. It lives in Obsidian, my "second brain," or whatever you want to call it and contains everything that matters:
SOAP templates, fluid calculations (BER, dehydration correction, FLK CRI recipe), drug dosing list, dispensing instruction templates, safety flags (NSAID + steroid → flag, acetaminophen in cats → flag, enrofloxacin > 5 mg/kg in cats → flag, etc.), narration style, output format rules...
I edit it in Obsidian, sync to Drive, and a Cloud Function pulls it into the prompt at request time. So technically not RAG — it's a static system prompt that's loaded fresh per session, with the entire ruleset in context every call.
The Obsidian doc IS the product. The frontend is just a recorder and a paste target. The intelligence is whatever the LLM does with that markdown.
What works: Gemini via Gems is the most consistent of the frontier models I've tried. Claude is great when it doesn't truncate. ChatGPT is fine but sometimes ignores the formatting rules.
What doesn't: I cannot get consistent output from local models. Same prompt, same input — some runs are clinical-grade, others miss whole sections, ignore the safety flags, or hallucinate medications. Hard to put into actual clinical use when output quality is a coin flip.
My setup: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox host, running AnythingLLM + Ollama (llama.cpp). Happy to swap either layer if there's a reason to.
I've tried multiple, Gemma 4 (all of them, but the largest/dense doesn't fit with my system), Qwen 3.6 35b a3b, multiple others
Questions:
- Am I just picking the wrong models? What's been most reliable for following long, structured system prompts with strict output formats — particularly anything that fits comfortably on 32GB VRAM?
- Is fine-tuning a real option here, or am I underestimating sampling parameters / context-window discipline? The temperature is already low.
- With that said, I have no idea how to fine-tune a model, and it sounds like it may be outside my skill set, but if feasible, and in the right direction, I will put in the time to learn.
- Is the methodology wrong? Should I be doing actual RAG — chunking the rules doc and retrieving per-section rather than dumping the whole file into the system prompt every call?
- Does the inference layer matter for this? AnythingLLM vs raw llama.cpp vs vLLM vs something else?
Happy to share the markdown file structure if it helps. Mostly I want to understand whether local-LLM inconsistency is a "find the right model" problem, a "you're prompting wrong" problem, or a "you actually need to train this" problem.
I am not a 'coder', I like to think I am pretty tech savvy, been working with computers for 30 years, but in the end, "I'm a vet, not an engineer".
Thank you for reading, and any direction would be appreciated.
Edit: The Markdown is roughly 25–30k tokens