Got local RAG to surface the right schematic without a vision model — here's how
Been building a local RAG stack for aviation technical manuals (the kind you legally can't upload to ChatGPT). Hit a wall that I think a lot of people hit: the model would cite "see Figure 9-02-40" but the user was left hunting through a 600-page PDF manually.
Solved it without a VLM. Here's the approach:
PDFs with safety-critical schematics have figures that live *near* the text that references them but aren't embedded as extractable image objects — they're rendered geometry on the page.
Fixed using pdfplumber gives you word coordinates. When a RAG chunk contains a figure reference (Fig 4-12, HYDRAULIC SYSTEM SCHEMATIC, "refer to the following diagram"), you can:
Parse the reference from the retrieved chunk
Look up which page it came from (already in metadata)
Use pdfplumber to crop a bounding box around the figure label coordinates
Render and return it inline
No VLM. No vision API call. Sub-second. Runs entirely on local hardware.
The coordinate precision is what makes it work — you're not guessing, you're reading the PDF's native geometry to find exactly where the schematic sits relative to its caption.
Stack: pdfplumber + ChromaDB + Ollama (Gemma 3 / whatever fits your GPU). Works on an RTX 3080 Ti with a 3,500-chunk corpus no problem.
Happy to share more detail on the figure detection regex or the crop logic if anyone's building something similar.