u/EconomySalamander706

Ran a side-by-side test today and I'm genuinely confused about how this model gets called "good at coding."

Setup: built the same custom assistant in both Gemini (as a Gem, 2.5 Pro) and Claude (Opus 4.7). Same custom instructions, same two reference markdown files, fresh chats. The assistant's job is dead simple: I show it a screenshot of a UI, it writes me a prompt I can paste into Figma Make to recreate that screen. That's it. Translate image → text prompt. The downstream tool (Make) doesn't see the screenshot, only the text I paste.

Claude got it on the first try. Looked at my screenshot, wrote a detailed prompt with all the actual labels, IDs, card titles, indication strings, x-axis values verbatim. Pasted into Make, got back something recognizably my reference screen.

Gemini wrote "replicate the layout from the screenshot" into the prompt. Bro. Make can't see the screenshot. You're the translator. That's literally the whole job (described in instructions).

I corrected it, it apologized, tried again, this time descriptive. Cool. Pasted Prompt 2 into a new Make file. Then we move to the next prompt in the chain. Gemini just… forgets what we were building, and propose me designs, navigation I never asked for. Completely new interface (meanwhile Claude's chain stayed locked on my actual reference the whole time)

So here's what bugs me. Everyone says "Gemini 2.5 Pro is great at coding" and points to benchmarks. But this isn't even a coding task. It's "look at this thing, describe it for someone who can't see it." If a model can't track what its own downstream reader can see, how does anyone trust it on agentic stuff, multi-file refactors, or anything where output from step 1 feeds into step 2?

Ofc, I am still new in this field, but I can't find any legit source that explains why this difference is so HUGE. Why most of the benchmarks show 2.5 like competitive tool, when it acts like a brain-rot.

I'll be grateful for answers!

reddit.com
u/EconomySalamander706 — 17 days ago