u/Euphon1um — reddlx

I tried OpenGUI recently. The idea is pretty fun: instead of only using a browser or API, the agent works with a real Android phone.

It screenshots the screen, uses a VLM to understand the UI, plans one step, performs an Android accessibility action, then checks the screen again.

What makes this interesting to me is not the clicking itself. It is the longer loop: observe, plan, act, verify, recover.

Mobile UI is messy. Search boxes, tabs, popups, loading states, disabled buttons, ambiguous icons, and result cards all need to be understood correctly. The model also needs to know whether the last action actually worked.

I can imagine using this for normal mobile workflows: searching movie or anime info, checking game updates, collecting learning materials, searching Reddit/X discussions, or doing repetitive app tasks.

Setup is still developer-facing, but Codex made it much easier to run locally.

Curious what VLM people would try first for this kind of mobile GUI automation.

I’m involved with OpenGUI, an open-source Android GUI agent project. I’ve been helping test the Android client, backend setup, and release flow.

I kept running into the same limitation with AI agents.

They can browse websites. They can call APIs. They can write code. But the moment the workflow moves into a mobile app, everything gets awkward.

A lot of real tasks still happen on phones: checking app content, searching social platforms, collecting posts, looking through course apps, reading comments, comparing information across apps. For these tasks, browser automation is not always enough.

So OpenGUI is trying a different path: let an AI agent operate a real Android device.

Not a mock UI. Not just a web page. A real phone screen.

The loop is simple in theory:

capture the Android screen
let a VLM understand the current UI
plan the next step
execute tap / swipe / type through Android AccessibilityService
observe again
continue, retry, or recover when the UI changes

The hard part is long-horizon reliability.

The model needs to understand mobile UI intent: search boxes, tabs, modal dialogs, information cards, disabled buttons, ambiguous icons, loading states, and whether previous actions actually took effect.

That is why I’m especially interested in local VLMs for this.

For long-running Android GUI tasks, which model direction would you try first?

Qwen-VL? InternVL? UI-TARS-style models? AgentCPM-GUI? Something else?

The things I care about most:

- mobile UI understanding

- grounding actions to screen coordinates or UI elements

- multi-step reliability

- recovery after failed or ambiguous actions

- not needing a custom adapter for every single app

I don’t think this problem is solved yet. But watching an agent control a real Android phone makes the “mobile AI agent” idea feel much more concrete than another browser demo.

Repo for context:

https://github.com/Core-Mate/open-gui