I tried OpenGUI recently. The idea is pretty fun: instead of only using a browser or API, the agent works with a real Android phone.
It screenshots the screen, uses a VLM to understand the UI, plans one step, performs an Android accessibility action, then checks the screen again.
What makes this interesting to me is not the clicking itself. It is the longer loop: observe, plan, act, verify, recover.
Mobile UI is messy. Search boxes, tabs, popups, loading states, disabled buttons, ambiguous icons, and result cards all need to be understood correctly. The model also needs to know whether the last action actually worked.
I can imagine using this for normal mobile workflows: searching movie or anime info, checking game updates, collecting learning materials, searching Reddit/X discussions, or doing repetitive app tasks.
Setup is still developer-facing, but Codex made it much easier to run locally.
Curious what VLM people would try first for this kind of mobile GUI automation.