u/Enough-Astronaut9278

has anyone tried local VLMs for desktop GUI automation?

Trying to use a quantized VLM on Apple Silicon to do desktop GUI automation from screenshots. Works ok for basic stuff but small icons and dense UIs are rough. Also the visual token count per screenshot is way higher than I expected which kills prefill speed.

Anyone else working on this locally? Curious what models/approaches people have tried.

reddit.com
u/Enough-Astronaut9278 — 2 days ago

Anyone running quantized VLMs on Apple Silicon for non-chat tasks?

Most of what I see discussed here is text-only LLMs, but I've been getting into running vision-language models locally and I'm curious what others are doing.

My use case is GUI automation — feeding screenshots into a VLM and having it output actions (click coordinates, typing, etc). It's a weird workload compared to chatbots because the model runs in a tight loop: capture screen → infer → act → capture again. So latency per call matters way more than throughput.

A few things I've noticed with quantized VLMs on M-series chips:

  • Memory bandwidth seems to be the actual bottleneck, not compute. The vision encoder produces a ton of tokens from a single screenshot and prefill dominates the workload.
  • w4a16 quantization (4-bit weights, 16-bit activations) seems to hold up better for visual tasks than I expected. My theory is that structured visual reasoning (identifying UI elements, reading text on screen) is more tolerant of weight quantization than open-ended text generation.
  • Unified memory is a huge advantage here since the full image doesn't need to be copied between CPU and GPU memory.

Questions for the community:

  1. Is anyone else running multimodal models locally for tasks other than chat/image description?
  2. For those on Apple Silicon — what VLMs are you running and at what quantization level?
  3. Any tips on reducing prefill latency for high-resolution image inputs?
reddit.com
u/Enough-Astronaut9278 — 6 days ago