u/Enough-Astronaut9278

Most of what I see discussed here is text-only LLMs, but I've been getting into running vision-language models locally and I'm curious what others are doing.

My use case is GUI automation — feeding screenshots into a VLM and having it output actions (click coordinates, typing, etc). It's a weird workload compared to chatbots because the model runs in a tight loop: capture screen → infer → act → capture again. So latency per call matters way more than throughput.

A few things I've noticed with quantized VLMs on M-series chips:

Memory bandwidth seems to be the actual bottleneck, not compute. The vision encoder produces a ton of tokens from a single screenshot and prefill dominates the workload.
w4a16 quantization (4-bit weights, 16-bit activations) seems to hold up better for visual tasks than I expected. My theory is that structured visual reasoning (identifying UI elements, reading text on screen) is more tolerant of weight quantization than open-ended text generation.
Unified memory is a huge advantage here since the full image doesn't need to be copied between CPU and GPU memory.

Questions for the community:

Is anyone else running multimodal models locally for tasks other than chat/image description?
For those on Apple Silicon — what VLMs are you running and at what quantization level?
Any tips on reducing prefill latency for high-resolution image inputs?

has anyone tried local VLMs for desktop GUI automation?

Anyone running quantized VLMs on Apple Silicon for non-chat tasks?